Google 



This is a digital copy of a book that was preserved for generations on library shelves before it was carefully scanned by Google as part of a project 

to make the world's books discoverable online. 

It has survived long enough for the copyright to expire and the book to enter the public domain. A public domain book is one that was never subject 

to copyright or whose legal copyright term has expired. Whether a book is in the public domain may vary country to country. Public domain books 

are our gateways to the past, representing a wealth of history, culture and knowledge that's often difficult to discover. 

Marks, notations and other maiginalia present in the original volume will appear in this file - a reminder of this book's long journey from the 

publisher to a library and finally to you. 

Usage guidelines 

Google is proud to partner with libraries to digitize public domain materials and make them widely accessible. Public domain books belong to the 
public and we are merely their custodians. Nevertheless, this work is expensive, so in order to keep providing tliis resource, we liave taken steps to 
prevent abuse by commercial parties, including placing technical restrictions on automated querying. 
We also ask that you: 

+ Make non-commercial use of the files We designed Google Book Search for use by individuals, and we request that you use these files for 
personal, non-commercial purposes. 

+ Refrain fivm automated querying Do not send automated queries of any sort to Google's system: If you are conducting research on machine 
translation, optical character recognition or other areas where access to a large amount of text is helpful, please contact us. We encourage the 
use of public domain materials for these purposes and may be able to help. 

+ Maintain attributionTht GoogXt "watermark" you see on each file is essential for in forming people about this project and helping them find 
additional materials through Google Book Search. Please do not remove it. 

+ Keep it legal Whatever your use, remember that you are responsible for ensuring that what you are doing is legal. Do not assume that just 
because we believe a book is in the public domain for users in the United States, that the work is also in the public domain for users in other 
countries. Whether a book is still in copyright varies from country to country, and we can't offer guidance on whether any specific use of 
any specific book is allowed. Please do not assume that a book's appearance in Google Book Search means it can be used in any manner 
anywhere in the world. Copyright infringement liabili^ can be quite severe. 

About Google Book Search 

Google's mission is to organize the world's information and to make it universally accessible and useful. Google Book Search helps readers 
discover the world's books while helping authors and publishers reach new audiences. You can search through the full text of this book on the web 

at |http: //books .google .com/I 



HARVARD UNIVERSITY 




LIBRARY OF THE 

GRADUATE SCHOOL 
OF EDUCATION 




THE ELEMENTS OF 
STATISTICAL METHOD 



THE MACMILLAN COMPANY 

NEW YORK BOSTON CHICAGO 

DALLAS SAN FRANCISCO 

MACMILLAN & CO., Limited 

LONDON BOMBAY CALCUTTA 
MELBOURNE 

THE MACMILLAN CO. OF CANADA, Ltd. 

TORONTO 



THE ELEMENTS OF 

STATISTICAL METHOD 

4^ 



BT 



WILLFORD I. KING, M.A. 

INSTRUCTOR IN STATISTICS IN THe"uNIVERSITY OF WISCONSIN 



Neto Yotit 
THE MACMILLAN COMPANY 

LONDON: MACMILLAN & CO., Ltd. 

1914 



^^^ Depfc tf EdveitM Library. 



/7; 




COPYKIGHT, 1912, 

By the MACMILLAN COMPANY. 



Set up and electrotyped. Published April, 19x2. Reprinted 
January, 1913 ; July, 1914. 



yortDoolr $Tr00 : 
Berwick & Smith Co., Norwood, Mass., U.S.A. 






DEDICATED 
TO 

THOMAS SEWALL ADAMS, Ph.D. 

• OF THE 

WISCONSIN TAX COMMISSION 



PREFACE. 

The purpose of this book is to furnish a simple text 
in statistical method for the benefit of those students, 
economists, administrative officials, writers, or other 
members of the educated public who desire a general 
knowledge of the more elementary processes involved 
in the scientific study, analysis, and use of large masses 
of numerical data. While it is intended primarily for 
the use of those interested in sociology, political econ- 
omy, or administration, the general principles set forth 
are applicable likewise to every variety of statistical 
data. The author has found that the members of his 
classes in this subject were not, as a rule, expert mathe- 
maticians, and he believes that this is true of a majority 
of those persons who are called upon to make practical 
use of statistics, hence, no pretense whatever has been 
made, in this work, of presenting any but the most 
simple of .the mathematical theorems upon which 
statistical method is based. 

So far as the author is aware, there is no book pub- 
lished in America, which attempts to cover the field of 
statistical method in its present state of advancement. 
There are several excellent treatises published abroad 
but they either embrace but a part of the subject or 
are adapted especially to the biologist, to the advanced 

Vll 



viii PREFACE. 

student of statistics, or to those having considerable 
mathematical training. Under these circimistances, it 
is believed that there is place for an elementary text 
of this nature. 

References are given to only a few of the principal 
works on the subject. It is not expected that the 
student will read all of those listed at the close of any 
chapter, but, when all are available, it will usually be 
best to make use of them in the order in which they 
are named. If more advanced study of any topic is 
desired, the student will find abundant references cited 
in those books. 

My thanks are due to Dr. Horace Secrist, Professor 
John R. Commons, and Professor T. K. Urdahl, all of 
the University of Wisconsin, for reading my manuscript 
and ofifering me many valuable suggestions which have 
resulted in its improvement. 

To Dr. Thos. S. Adams, my former instructor, now 

of the Wisconsin Tax Commission, I am indebted for 

the major part of all that has made this work possible, 

and any merit which it may possess must be credited 

largely to his efforts. 

WiLLFORD I. King. 

University op Wisconsin, 
September, 1911. 
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PART I. 

INTRODUCTION. 



CHAPTER I. 



THE HISTORICAL DEVELOPMENT OF 
STATISTICAL SCIENCE. 

Sec. 1. Preliminary Remarks. 

During the last fifty years, the proper system of 
dealing with large numbers, in other words the science 
of statistics, has first attained importance in the public 
mind and has first been perfected to such an extent as 
to really have attained the dignity of a science. Never- 
theless, this stage is but the latest story to be added to 
a building whose foundations were laid many centuries 
since, and, while the main purpose of this book is to 
deal with the methods of procedure used in the existing 
stage of advancement, it seems that the present picture 
may stand out more clearly if placed in its proper 
setting by means of briefly outlining the development 
of statistics from its early beginnings down to the 
present day. 

Sec. 2. Statistics in Ancient Times. 

The growth of statistics is coeval with the growth of 
national organization. As soon as tribes were united 
into coherent confederacies or distinct nations, it be- 
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* 

came necessary for the ruler to collect facts concerning 
his domain. He must needs know something of its 
wealth in order to compute the amount of taxes or 
tribute which he might levy. He must have a census 
of his fighting men so that he might know his war 
strength. On certain occasions, unusual situations 
might arise which would necessitate enumerations for 
other specific purposes. One of our eariiest statistical 
compilations tells us of the collection of data concerning 
the population and wealth of Egypt in order to make 
arrangements for the construction of the pyramids. 
This occurred about 3050 B.C.^ Many centuries later 
(about 1400 B.C.), Rameses II took a census* of all 
the lands of Egypt in order to reapportion them among 
his subjects in such a way as he deemed proper. 

In the first and second chapters of the Book of 
Numbers, we read how Moses numbered the tribes of 
Israel doubtless with a view of determining their 
fighting strength. Another census was taken by David, 
about 1018 B.C., for similar purposes.^ Even in the 
Far East, similar forces were at work and it is re- 
corded that the Chinese government had a description 
of the provinces compiled by Yuking as early as 1200 
B.C. 

Herodotus tells us that "Lykurgus divided the terri- 
tory of Laconia into 39,000 portions, assigning to the 

1 Herodotus II, 109. 

2 Herodotus II, 177. 
» II Samuel XXIV. 
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Spartans 9,000 portions and to the Lacedaemomans 
30,000 portions."! 

This was but one of many enumerations made by 
the ancient Greeks for the purposes of apportioning 
land, levying taxes, classifying the inhabitants and 
determining the military strength. In Rome, after the 
days of Servius TuUius, quite elaborate censuses were 
taken, primarily for the purposes of taxation and as- 
certainment of population, and the inhabitants of the 
city were required to register births and deaths at 
certain specified temples. 

During the Middle Ages, the feudal barons and their 
imperial over-lords frequently enumerated the popula- 
tion and property of their domains, investigations of 
this nature by Charlemagne, William the Conqueror, 
Al-Mamum, Emperor Frederick II of Germany, and 
Edward II of England being among those recorded by 
historians of the day. 

It should be noted that, in each of the above instances, 
the intent of the census was to aid the government in 
its administrative work or its plans for war; taxation, 
land distribution, and available soldiers being the com- 
monest subjects of inquiry calling forth an enumeration. 
With the exception of the Roman censuses, such in- 
quiries seem to have been undertaken only when some 
special reason existed for collecting the data and not 
at any regular intervals. 

1 Meitzen, A., Siatistica, p. 16. 
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Sec. 3. Mercantilistic Period. 

During the period in which Mercantilism dominated 
the policies of the Western European governments, a 
marked increase in the bulk of statistics collected is 
noticeable. This was the natural outcome of the beUef 
that the government should encourage certain Hnes of 
industry and should adopt all necessary measures to 
secure a favorable balance of trade. In order to judge 
correctly of needs for and effects of the various kinds of 
legislation, more elaborate statistics were necessary than 
had hitherto been considered essential. Besides, the 
growth of centralized monarchy, with the accompanying 
elaboration of government, gave a greater necessity 
for extensive statistical information than had been 
required during the Middle Ages, and T^dth this greater 
necessity, came also the increase of ability required 
to successfully carry out such investigations as might 
be desired. Success now attended that monarch who 
could, in advance, best measure his resources as com- 
pared with those of his rivals, and then best husband 
these resources for times of conflict. 

We find Philip II of Spain making extensive in- 
quiries in 1575 A.D., from the prelates and corregidors 
of Spain concerning the districts over which they had 
supervision. At the beginning of the seventeenth 
century, Sully prepared for his master, Henry of 
Navarre, a comprehensive statement of the financial 
and military resources of France; in 1665, Colbert com- 
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piled extensive statistics of trade ;^ and, in 1C99, Louis 
XIV required reports on the state of the country from 
each of the general intendants. 

It was Prussia, however, which, in modem times,^ 
first began a systematic, periodic collection of statistical 
data. In 1719, Frederick William I began gathering 
semi-annual reports as to population, occupations, 
houses, real estate holdings, taxes, city finances, etc. 
At a later date, these figures were collected and tabulated 
at intervals of only three years. Frederick the Great, 
likewise, was a firm believer in the value of statistical 
information and he enlarged the scope of the inquiries 
by including such things as nationality, age, deaths and 
their causes, data concerning agriculture, trade, manu- 
factures, shipping, value of property, etc. He even 
took a decided personal interest in the work and, as 
a result, the accuracy and comjjleteness of the informa- 
tion was immensely improved. During the period 1747 
to 1782, a complete statistical system was thus worked 
out. 

Sec. 4. The Modem Census. 

The idea of the decennial census seems to be an 
American product. The provision for representation 
in the lower house of Congress in accordance with 
population made a census indispensable, and hence, this 

^ Block, M., Traite de Statistiquef p. 25, also Bertillon, Cours 
EUmentaire de Statistiquej p. 27. 
* Meitzen, A., StatisticSj p. 27. 
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was provided for in the Constitution and the first census 
was taken in 1790. Slightly over a decade later, (in 
1801), England, too, adopted a similar plan of enumera- 
tion. 

The German ZoUverein of 1833, which eliminated 
interstate duties within the German boundaries and 
preserved only a connnon external customs barrier, 
provided for a distribution of the proceeds of the tariiBf 
according to population.^ In order to secure a correct 
apportionment, a triennial census was established. The 
idea of a regular enumeration gained steadily in favor 
and was adopted by one after another of the civilized 
nations. Finally, in 1911, we find China taking her 
first official census. 

As time has passed, the censuses have grown larger 
and larger in scope, and during the last three or four 
decades, have become extremely elaborate. In 1900, 
the United States established a permanent census 
bureau which devotes itself continuously to the working 
out of many special problems and the elucidation of the 
statistics collected during the regular census periods. 

Most leading nations also have special statistical 
bureaus which, by means of scientific estimates, attempt 
to keep the statistics of a nation abreast of the tin^s. 
An example of this idea in the United States is our 
national Bureau of Statistics. Many of the states have 
also provided similar bureaus for their respective needs. 

^ Bertillon, J., Cours EUmentaire de Statistigue, p. 23. 
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Sec. 5. Comparative Statistics. 

Today we have great masses of statistics collected by 
numerous sources, public and private, but with little 
co-ordination of work as regards different bureaus in 
the same nation. As a result, while each collection is 
valuable for purposes of its own, comparison of data 
for different places is still very difficult. In the earlier 
statistical inquiries, this was probably not even thought 
of but, with the appearance of national rivalry among 
the leading European nations and the general adoption 
of a Mercantilistic policy, such comparisons began to 
be made. 

As eariy as 1544, Sebastian Muenster, a professor at 
Heidelberg, published a systematic treatise on the 
ancient countries, their organization, wealth, armies and 
fighting strength, commerce, church relations, laws, etc.^ 
In 1562, Francesco Sansovino,^ and, in 1589, Giovanni 
Botero, both Italians, published works of a similar 
nature, and, in 1614, Pierre d'Avity, Seigneur de 
Montmarin, followed with a more accurate and com- 
plete treatise in four volumes dealing, likewise, largely 
with comparative statistics. These works were pat- 
terns for numerous others which followed at later dates 
until, today, we have statistical dictionaries treating 
of almost every conceivable field of investigation, but 
still necessarily deficient in accuracy because of the 

^ Meitzen, A., Statistics y p. 20. 

* John, v., Geschichte der Staiistiky p. 38. 
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lack of uniform inquiries ip the various parts of the 
worid. 

Sec. 6. Vital and Social Statistics. 

We have outlined briefly the growth of statistics 
gathered by the various governments in order to 
measure their strength or assist in their administration 
but, in the early part of the seventeenth century, cer- 
tain new uses were suggested for some of the statistical 
data which had been collected. During the early period 
of the Reformation, the Protestant churches, largely 
in an effort to check illegitimacy, required the registra- 
tion in the church records of all births, deaths, and 
marriages. In many German and English cities these 
decrees were carried out with a fair degree of complete- 
ness. In 1612, Professor George Obrecht,^ of Strasburg 
University, proposed that the government ke<5p a com- 
plete record of all vital and criminal statistics, worked 
out a plan for carrying out his ideas, and illustrated, 
with remarkable insight, the uses to which such statistics 
could be put in devising methods for reforming the 
morals of the people, and also for providing a system of 
life insurance and pensions. 

In 1661, Capt. John Graunt, of London, made the 
first recorded analytical study in the field of vital 
statistics.* He came to the conclusions that the birth 
and death rates were quite constant; that the births 

^ Meitzen, A., Statistics j p. 25. 

* John, v., Geschichte der Statistiky p. 225 f. 
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were distributed among the sexes in the ratio of 14 boys 
to 13 girls; that the deaths among a given hundred of 
persons bom could be calculated for each succeeding 
year; and, therefore, that, from an accurate birth 
record, the total population of the country could be 
computed. 

In 1691, Caspar Neumann,* prebendary of Breslau, 
collected, from the parish registers of that city, records 
of 5,869 deaths and, from these figures, proceeded to 
demonstrate that no such fateful significance as had 
usually been supposed could be attached to the ages 
seven or nine. His notes and conclusions fell into 
possession of the Royal Society of England, and thus 
came to the attention of the noted astronomer and 
scientist, Edmund Halley. He utilized Neumann's 
figures in the computation of the first recorded com- 
plete life table and derived therefrom the expectation 
of life at each age and a scientific system of life insur- 
ance though he failed to take into account, in his 
calculations, the increase in population. 

Life insurance was, at this time, by no means un- 
known, having originated in wagers on the death of 
the captain of a vessel, these wagers being used to 
protect the ship owners in case of the loss of the vessel. 
A similar system is still in vogue as regards the life 
of some prominent person whose death might seri- 
ously inconvenience business. This system, however, 

* John, v., Geschichte der StcUistik^ p. 208 f. 
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was entirely unsuited to the person who wished to 
insure his own Ufe for the benefit of his family and 
the charges had never been, in any degree, scien- 
tifically exact. Hence, Halley's tables formed the 
foundation for life insurance in the modem sense. 
In 1698, the first life insurance institution was founded 
in London and, a year later, the "Society of Assur- 
ancy for Widows and Orphans" came into existence.^ 
Statistics was further allied to the province of 
mathematics by Jacques Bemouilli, a professor of 
Basel, who died in 1705, leaving behind a work in 
which he mathematically eludicated the theory of 
probabilities, a theory which has played no small part 
in the development of modern statistical science. 
Another advance was made by Johann Peter Siissmilch* 
who, in 1741, published a treatise in which he attempted 
to demonstrate, statistically, the doctrine of the 
"Natural Order."* He showed the approximate 
equality in numbers of the sexes at the time of mar- 
riage and construed this as a divine command in favor 
of monogamy. He, further, worked out the age 
constituency of the population and the constancy of 
the ratio between births and deaths. The fact that 
the death rate was larger in the city than in the country 
he interpreted to mean that in the cities luxury and 
vice flourished, hence bringing down the wrath of God. 

^ Meitzen, A., Statistics^ p. 33. 
' Meitzen, A., Statistics^ p. 35. 
' John, v., Geschichte der Statistikj p. 269 f. 
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Sussmilch's statistical studies were followed up 
during the early part of the nineteenth century by 
those of the renowned scientists Laplace and Fourier. 
A little later, Lambert Adplphe Jacques Quetelet, a 
prominent Belgian astronomer and mathematician, 
made extensive statistical studies in the realms of 
,^ astronomy and meteorology. His investigations con- 
cerning the weather led him to a study of the periodical 
phenomenao^j^figetation and it was but a step further 
to include the animaj^kingdom and then mankind in 
the scope of his research. The social and moral as 
well as the physical characteristics of men were 
embraced in the field of inquiry. He made the sur- 
prising discovery that similar results were obtained 
from each and every variety of phenomena observed. 
In each case, a certain mean or norm was found to 
exist^ about which the number of instances or occur- 
rences was great,' and as the distance from the mean 
increased the number of items fell off with mathe- 
matical regularity. In fact, he found that if the 
numbers of occurrences were plotted as ordinates 
that the result was a regular binomial curve identical 
with that given by the mathematical law of chance 
or probability. This seemed to indicate that man's 
actions are governed wholly by this same law for he 
showed that all kinds of human acts occurred with 

^ Quetelet described in detail the characteristics of the mean 
or average man representing the normal type of the race. 
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marked regularity. Crimes, suicides, accidents, all 
showed comparatively constant figures. This, he 
believed, proved man to be largely the product of 
his environment, society to be responsible for the 
individual, yet he expressly denied any restraining 
force on the individual and rejected fatalism as an 
explanation.^ 

Many of Quetelet's followers, however, continued 
his reasoning to what they believed to be a logical 
conclusion. This is exempUfied by Sir F. W. Herschel 
who, in 1850, declared that man was wholly the creature 
of environment and that free will, if existent, was 
practically non-perceptible. H. Thomas Buckle, the 
historian, voiced his approval of the same idea. 

The Italian school, on the other hand, were not 
at all ready to accept this doctrine without question 
and they were supported by many German statis- 
ticians. In 1871, Gustav Schmoller^ illuminated the 
question by asserting that the regularity of human- 
action was only due to regularly acting causes which, 
in general, tended to produce constant results. Free 
will was shown by the fact that results were not entirely 
regular but, at times, varied decidedly from the usual 
order, even though the causes remained constant. 
This view has since gained wide acceptance. 

^ Meitzen, A., Statistics, p. 75, also John, V., Geschicht£ der Sta- 
tistik, p. 332 f . 
\^ * Meitzen, A., StatisticSf p. 88. 
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Sec. 7. Statistics as an Aid to Economics. 

From the earliest times, statistics were considered 
necessary as an aid to administration and, in the days 
of the Mercantilists and Cameralists, many govern- 
mental policies were based on statistical information. 
Late in the seventeenth century, Gregory King had 
attempted to show statistically a fixed relationship 
between the supply and price of commodities. Most 
of the economic writers of the eighteenth century 
made more or less use of numerical data to establish 
their theories, but it remained for the Historical School 
of economists to emphasize the importance of statis- 
tics in the economic field. Since they assumed that 
economic laws and doctrines were not to be reasoned 
out abstractly but proven historically and concretely, 
statistics became, to them, a prime necessity. Bnmo 
Hildebrand, who had much practical experience as a 
governmental statistician, was a leading expositor 
of this idea and the correct methods of applying sta- 
tistics were greatly developed by his contemporary, 
Karl Knies.^ 

Sec. 8. Statistical Method. 

In analyzing statistical data, it was, of course, 
discovered, even in the earliest experiments, that 
some definite method must be followed in order to 
render the results intelligible. As data became more 
abundant and many new fields were opened up to 

^ Meitzen, A., Statistics j p. 77. 
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investigation, this subject of method became more 
and more complex. Refined scientific inquiries could 
not be conducted by the crude and cumbersome 
machinery suited only to simpler problems. As early 
as 1711, Anchersen,^ a Dane, devised statistical tables 
for the comparisons of European states^ and, in 1782, 
Crome, of Giessen, went further and utilized geometric 
figures for like purposes. F. J. Mone, in his "Theory 
of Statistics," 1824, emphasized the necessity of carefully 
worked out methods for the solution of all statistical 
problems. Ernst Engel, in his "Methods of Enumer- 
ating Population," 1861,' developed this idea more 
completely than ever before and he was ably seconded 
by Rumelin, Knies, Adolf Wagner, and M. Block. 
Within the last three or four decades, the development 
of the pure theory of statistics has had a remarkable 
jynwt.bt^ Such men as August Meitzen, Francis Edge- 
worth, Francis Galton, E. L. Thomdike, Karl Pearson, 
G. Udny Yule and C. B. Davenport in the field of 
biological statistics, and Jacques Bertillon, Arthur L. 
Bowley, R. H. Hooker, Thos. S. Adams, and Warren 
Persons in the field of economics have aided in advanc- 
ing the theory far beyond its former boimds. It is to 
the simpler outlines of this branch of statistics that 
this volume is devoted. 

* John, v., Geschichte der Statistik, p. 88. 

* Meitzen, A., Statistics, p. 41. 

* Meitzen, A., Statistics, p. 96. 
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Sec. 9. Instruction in Statistics. 

The early lecturers and writers who included among 
their teachings the subject which we now call statistics 
only introduced numerical quantities as a detail in the 
instruction in general political, geographic, and eco- 
nomic information and it was to this entire field of 
thought that the term "statistics" was first applied. 
The first recorded lectures in this line were given by 
Hermann Conring at the University of Helmstedt in 
1660.* Conring was a noted physician and also a pro- 
fessor of natural law. His statistical lectures dealt, how- 
ever, with data concerning the land and its products and 
the state and its resources. The Cameralists of the early 
eighteenth century followed his example and made "sta- 
tistics" a part of their teachings. The first, however, 
to organize this mass of knowledge into a logical whole 
' was Gottfried Achenwall (often called the "Father of 
Statistics") a professor of the University of Marburg. 
It was he who first applied to this line of study the 
appellation "statistics," deriving the term from the 
Italian word "statista" meaning statesman. Achen- 
walUs main idea was the comparison of one state with 
another in order to find a correct guide for political 
action. He dealt in his lectures with Spain, Portugal, 
France, Great Britain, the Netherlands, Russia, Den- 
mark, and Sweden.^ He began his lectin-es in 1746. 

^Meitzen, A., Statistics, p. 22, and John, V., Geschichte der 
Statistik, p. 52. 

* Meitzen, A., Statistics, p. 24, also John, V., Geschichte der 
Statistik, p 74 f. 
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We have noted the fact that, at this time, the instruc- 
tion given under the title of "statistics" included all 
that the schools then taught of political economy and 
geography Adam Smith first really segregated polit- 
Ical economy as a separate science when he published 
''The Wealth of Nations" and this work was soon fol- 
lowed by the teachings and writings of Stewart, ^al- 
thus, Ricardo, Say, Sartorious, Jacob, and Kraus. 

In the early part of the nineteenth century, C. Ritter 
published his "Science of the Earth in Relation to the 
Nature and History of Men." This voluminous work 
tended to estabUsh geography, also, as an independent 
science. Life insurance, which, as we have seen, played 
an important role in the origin of statistical studies 
likewise tended, as it was perfected, to become a distinct 
branch of study. 

Only in comparatively recent years have courses been 
offered in the leading universities of Europe and 
America on the science of statistics as the term is now 
limited and defined, and, even yet, the study is almost 
invariably taken up as a subordinate branch of political 
economy or biology. 

Sec. 10. Different Branches of Statistics. 

The modem domain of statistics can be generally 
divided into two main provinces, statistical method and 
applied statistics. 

Statistical method may properly be considered a 
branch of mathematics, inasmuch as it attempts to 
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formulate definite rules of procedure applicable in 
handling groups of data of many different varietfes. 
Many of these rules apply equally well to economic 
or biological data while, in other cases, special rules may 
be developed which are best adapted to one particular 
field. The methodologist is not concerned in any but 
the most indirect way with the specific investigations for 
which the general laws, rules, and methods which he 
formulates are to be used. 

Applied statistics, as the name signifies, consists of 
the apphcation of the rules and formulae laid down by 
the methodologist to the concrete facts as they exist, 
the relationship being the same as in the case of pure 
and applied science. While the methodologist is likely 
to be primarily a mathematician, the specialist in 
applied statistics may be a census expert, a state oflScial, 
a sociologist or philanthropist, a biologist, an economist, 
an insurance actuary, or an investigator in almost any 
other branch of human knowledge. 

The province of appUed statistics may be legitimately 
subdivided again into two general fields — descriptive 
and scientific. The descriptive field deals with records, 
either of things in their existing state or from the 
historical standpoint. The U. S. census with its mani- 
fold tables comparing the people and resources of the 
different sections of the United States both for the 
given time and with preceding decades is a most 
excellent example of well-developed descriptive statis- 
3 
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tics. In this branch of statistics, nearly every citizen 
is interested, since everyone wishes to know whether 
his state or nation is growing in population or wealth, 
whether foreigners are more or less numerous than in 
the neighboring commonwealths, etc. 

For the scientist, however, statistics have still a 
different interest. He wishes to use the data which 
record past events in order to establish definite 
physical or psychological laws. The biologist is anxious 
to verify his hypothesis concerning heredity, the meteor- 
ologist wishes to trace a connection between sun-spots 
and temperature, the economist desires to verify the 
quantity theory of money, and the statesman endeavors 
to demonstrate the salutary effects of a tariff law. 
Thus, scientific statistics makes use both of the rules 
laid down in statistical method and the data collected 
for descriptive purposes. Scientific statistics, then, is 
the ultimate goal toward which a large part of the work 
of all modem statisticians ultimately tends. 

Sec. 11. Summary. 

We have thus traced very briefly the development of 
the science of statistics from its primitive form to its 
present complex status. With this preliminary setting, 
the student will be perhaps prepared better to take up 
the study of statistical method as set forth in later pages. 
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CHAPTER II. 
THE SCIENCE DEFINED. 

Sec. 12. Definition of Statistics. 

We have seen, in the preceding chapter, that many 
different forms of knowledge have, in the past, been 
termed statistics and that, today, the science has shifted 
far from its old meaning, the study of the state. Our 
next task is to formulate a definition suited to the 
science in its present-day aspect, a definition which is 
inclusive enough to take in all that is still classed imder 
this nomenclature and exclusive enough to keep out all 
ideas extraneous to that title. Webster defines the term 
"statistics" thus: "Classified facts respecting the con- 
dition of the people in a state . . . especially those 
facts which can be stated in numbers or in tables of 
numbers or in any tabular or classified arrangement." 
This definition accords with the etymology of the word, 
which it will be remembered is derived from statist 
and more remotely from state. Until comparatively 
recent times, this phase of statistics was almost the 
only one worthy of mention, but nowadays, the term 
is applied to the study of data obtained in the fields of 
biology, astronomy, etc., so that the definition must, 
necessarily, be expanded to cover the new uses. 

A similar idea, but somewhat more comprehensive, 

20 
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is expressed by the following definition given by Bow- 
ley:^ "Statistics is the science of the measurement of 
the social organism, regarded as a whole, in all its 
manifestations." This statement, however, as its au- 
thor says, limits the science to only one field — that of 
man and his activities. Modem statistics takes into 
consideration, however, biological, astronomical, and 
physical as well as social phenomena, hence, the defini- 
tion is obviously too narrow. 

Statistics has also been denominated "the science of 
counting."* This obviates the above-mentioned error 
of confining the definition to only one field. On the 
other hand, serious defects of another kind at once 
appear in this definition. The major part of statistical 
work includes not only mere counting but also the 
further process of making estimates. In collecting its 
statistics of wheat grown in the United States, the 
Department of Agriculture does not attempt to get an 
actual record of each bushel produced but simply ob- 
tains estimates of the present year's crop as compared 
with that of previous seasons. In this way, crop 
reports of a fair degree of accuracy can be obtained with 
almost no actual counting. In fact, in dealing with 
large numbers, an accurate count is almost always a 
physical impossibility. It is self-evident, for example, 
that a large number of people are not counted when a 

* Bowley, A. L., Elements of Statistics f p. 7. 

* Bowley, A L., Elements of Statistics^ p. 3. 
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census is taken and the names of many others are 
entered in the lists of two or more different enumerators. 

Another defect of this definition is that it would seem 
to apply only to the collection of data and not to the 
analysis of the material collected while, as a matter of 
fact, both parts are essential to any complex statistical 
study. Hence, this definition must also be rejected 
as inadequate. 

One of the prime objects of statistics is to give us a 
bird's-eye view of a large mass of facts, to simplify this 
extensive and complex array of isolated instances and 
reduce it to a form which will be comprehensible to 
the ordinary mind. To att^n this end, averages are 
very often used, hence, Bowley says: "Statistics may 
rightly be called the science of averages."^ But 
modem statistics goes further than to present mere 
averages. By graphic processes, we see variation, that 
is, the fluctuations with regard to a certain standard or 
norm, portrayed as, for example, when we chart the 
temperature changes of a season or a cycle. By picto- 
grams are shown relative totals, familiar illustrations 
being the bars or squares used to show the relative 
population, wealth, products, or expenditures of dif- 
ferent nations. By correlation tables and coefficients, 
relationships are indicated. Therefore, this definition, 
likewise, seems far too restricted. 

A more comprehensive definition, and the one which 

* Bowley, A. L., Elements of Statistics^ p. 7. 
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we shall adopt for the purposes of this work, is the fol- 
lowing: The science of statistics is the method of 
judg^g collective natural or social phenomena from the 
results obtained by the analysis of an enumeration or 
collection of estimates. This is certainly more inclusive 
than any of the preceding definitions and, while it is 
possible that statistical problems might be imagined 
which would not fall within its limits, it is sufficiently 
broad for practical purposes. 
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CHAPTER III. 

USES, CHARACTERISTICS AND SOURCES OF 

STATISTICS. 

Sec. 13. Necessity of Statistical Science. 

The human mind is so constituted that it cannot 
image and comprehend a large number of distinct 
impressions at any one time. As a result, it is im- 
possible to compare intelligently two complex groups 
of things without simplification of the groups in some 
manner. It would be a man of exceptional mnemonic 
power who could, after listening to the reading of two 
lists of one hundred items each stating the names and 
wealth of the respective inhabitants of two villages, give 
any intelligent opinion as to the comparative riches of 
the two communities. If this is true for such small 
groups as this, it evidently would be utterly impossible 
to make comparisons of the wealth of great nations 
without some manner of reducing the mass of separate 
facts to a simple whole. The same would, of course, 
be true in the case of any other phenomena involving 
large numbers. What could one understand of the 
amount of lumber contained in a forest from a descrip- 
tion of the separate trees? How could one compare the 
climates of different localities by a study of their^^ily 
weather records? It is for the purpose of simplifying 
these unwieldy masses of facts that statistical science 

24 
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is useful. It reduces them to numerical totals or aver- 
ages which may be abstractly handled like any other 
mere nmnbers. It draws pictures and diagrams to 
illustrate general tendencies and, thus, in many ways 
adapts these groups of ideas to the capacity of our 
intellects. 

Sec. 14. Uses of Statistics. 

The facts having been once simplified they are now in 
a shape where they may be used for purposes of com- 
parison and this is one of the principal aims which the 
science of statistics has in view. We are interested to 
know the population of the United States not, primarily, 
for the value of that fact in itself, but, principally, in 
order that we may compare the population of today with 
that of past decades and thus picture in our minds the 
nation's growth or that we may compare the nmnbers of 
our people with the numbers of other lands or that we 
may compare the growth in oiu" population with the 
growth of our food supply, our manufacturing or mining 
industries, our increase in wealth, or any one of many 
similar things. Thus, it is relative rather than absolute 
size which appeals to our imaginations. 

These comparisons, however, are seldom made simply 
with the idea of satisfying our idle curiosity. They are 
necessary in order to settle the most weighty questions 
of government and economics. How could congres- 
sional constituencies be justly apportioned without a 



26 ELEMENTS OF STATISTICAL METHOD. 

population census? Is tuberculosis increasing or de- 
creasing? The answer to this question must be shown 
statistically and has weighty import from the points of 
view of public finance and general public policy in 
fighting the disease just as well as from the standpoint 
of the health of the inhabitants. Should the railroads 
be allowed to increase their freight rates? Before this 
can be determined, we must have reliable statistics of 
earnings and expenditures. In fact, few important 
actions can be properly taken by a modem government, 
or even a modem corporation, without a statistical 
study of conditions in the field in question. 

The rapid growth of the science of accounting and the 
general demand for uniform systems of accounts for all 
municipalities has greatly emphasized the need of cor- 
rect statistical methods in this line. 

Numerous commissions have recently sprung into 
being whose duties vary from the government of cities 
to the regulation or investigation of almost every phase 

of private or governmental activity. The reconmienda- 
tions or decrees of these commissions must, in nearly 
all cases, rest largely upon statistical information and. 
the merit of/ the results obtained therefore depends 
primarily on the correctness of the statistical method 
employed and the accuracy with which it is carried out. 

Every insurance company must base its rates upon* 
computations derived through the study of large masses 
of data. Since new forms of insurance are constantly 
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being evolved and since conditions of life are ever 
changing, new statistics must continually be collected 
and new calculations as continuously be made. An 
example of this is the insurance of workingmen against 
unemployment, a question which is now first being 
seriously considered in many countries. As yet, the 
statistics are far too incomplete to make possible a 
comprehensive scientific system of insurance in this 
line even if proper methods and safeguards can be 
worked out. 

But practical statesmen and men of affairs are not the 
only ones who find in statistics a most valuable ally. 
The theoretical economist or the scientific investigator 
in the field of natural phenomena must likewise con- 
stantly call on statistics for proof or verification of his 
j hypotheses or theories. The biologist thereby verifies 
^T the laws of variation and heredity. The economist 
seeks to establish laws of population, of wages, of prices, 
or to show the connection between different groups of 
phenomena as, for example, financial crises and unem- 
ployment. The sociologist would demonstrate the rela- 
tionship of sales of alcoholic hquor to crime, poverty, 
suicide, and similar phenomena where some connection 
exists or is suspected. 

Bowley says: "The proper function of statistics, 

f indeed, is to enlarge individual experience."^ Without 

a statistical study, most of our ideas are likely to be 

^ Bowley, A. L., Elements of StcUistics, p. 8. 
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decidedly vague and indefinite. A reduction to figures 
gives clear cut form to this hazy conception, enables us 
to set objects in their proper perspectives and relation- 
ship, and, hence, gradually, to work out the laws 
governing their movements and changes. 

Sec. 15. Law of Statistical Regularity. 

One of the most valuable characteristics of modem 
scientific statistics is that it succeeds in giving us a 
sufficiently accurate picture of a group of objects with- 
out going through the laborious and expensive process 
of a complete enumeration of all the items in the group. 
Thus, it is by no means necessary in ascertaining the 
average wage of American workingmen to obtain data 
regarding each man at work. If certain typical in- 
stances can be obtained and property averaged, the 
difference from the true average wage of all the working 
men is likely to be such a small quantity as to be, for 
all practical purposes, negligible. Similariy, the anthro- 
pologist can discover the physical characteristics of a 
tribe or race by taking careful measurements of only 
a small minority of the whole. This is due to the law 
of nature formulated in the mathematical theory of 
probabilities which states that a moderately large 
number of items chosen at random from among a very 
large group are almost sure, on the average, to have 
the characteristics of the larger group. Thus, if two 
persons, blindfolded, were to pick here and there three 
hundred walnuts each from a bin containing a million 



USES AND CHARACTERISTICS OF STATISTICS. 29 

nuts, the average weight of the nuts picked out by each 
person would be ahnost identical even though the nuts 
varied considerably in size. Furthermore, if one were 
to obtain the average weight of the whole million it 
would not dififer, essentially, from the average weight 
of either of the smaller groups. 

This principle may be easily verified by taking a 
small number of dice and throwing them forty or fifty 
times. If four dice are taken, the total number of 
spots on both sides is twenty-eight. On the average, 
half of these, or fourteen, should turn up each time. 
In fifty throws, the total number of spots turned up 
should be 700. Experiment will show that the ap- 
proach to this number will be surprisingly close. It is 
upon this principle that gamblers are enabled to run 
continuously and profitably with only small odds in 
their favor. It is this same principle which gives rise 
to the regularity in the number of crimes and the num- 
ber of suicides, facts which, as we have seen, once 
greatly troubled the advocates of the doctrine of Free 
Will. It is this principle which makes possible insur- 
ance against [death or other calamities. This princi- 
ple is frequently denominated the law of statistical 
regularity. 

It must not, however, be inferred from the above 
that any number of samples, no matter how large, will 
give exactly the same results as would be obtained by 
the use of the entire mass of data. The probability of 
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error diminishes constantly as the number of items 
used as samples increases. If, then, only a few sample 
items are used the chance error is likely to be so large 
as to seriously vitiate the results but, as the number of 
samples chosen grows large, the error diminishes until 
it eventually becomes negligible. 

Sec. 16. Inertia of Large Numbers. 

The law of inertia of large numbers is a corollary 
of the law of statistical regularity. It arises from the 
fact that, in most classes of phenomena, when one part 
of a large group is varying in one direction, the prob- 
abilities are that another equal part of the same group 
is varying Ia. the opposite direction; hence, the total 
change will be slight. Thus, for example, while the 
amount of wheat produced in any one locality varies 
immensely from year to year, the wheat production of 
the world, as a whole, remains relatively stable for 
decades. The losses from fire in a single city may be 
fifty times as large in a given year as in the preceding one 
but the annual losses throughout the entire country 
will remain almost constant. Hence, a fire insurance 
company can, years in advance, calculate its losses 
with a fair degree of accuracy. Statistical science, 
then, is largely based on the theory of probabilities 
and its corollaries. 

This property of inertia by no means precludes the 
possibility of change with the passage of time. It only 
means that when the numbers involved are of great 
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magnitude the change is likely to be more regular than 
in those cases in which small quantities are involved. 
Thus, while the fire losses in the United States would be 
relatively constant from* year to year they might never- 
theless be steadily diminishing owing to a tendency to 
erect stone or concrete buildings instead of frame 
structures. Similarly, the wheat production of the 
world gradually increases as new lands are brought into 
cultivation. 

This property of inertia is less evident when, for any 
reason, there is greater probability of variation in one 
direction than in the other. If, for example, in a given 
state practically all the cities had borrowed up to the 
legal debt-limit, a curtailment of the debt of one city 
would rarely be offset by an increase in the debt of 
some other, hence the stability of the total of the city 
indebtedness within the state would be affected to a 
larger extent relatively by such decreases in the debts of 
individual cities than if the maximum debt-limit were 
non-existent. , 

Sec. 17. Distrust of Statistics. 

It is said that non-scientific people may be divided 
into two classes as regards their attitude toward new 
inventions or discoveries. One class accepts without 
question the wildest stories of incredibly marvelous 
discoveries and wonders why no one stumbled upon 
them before. The other class, usually possessing a 
little more education, are skeptical of all scientific truth 
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and label it all alike as *' guess-work." Similar attitudes 
of mind, in regard to statistics, are noticeable among 
those unfamiliar with that science. The attitude of 
the first class is well expressed by the old proverb 
''Figures won^t lie," while the other class, a little more 
erudite, are prone to characterize all statistics as 
tissues of falsehood. Either theory can be readily 
proven by selecting proper examples. 

One of the shortcomings of statistics is that they do 
not always bear on their face the label of their quality. 
The crudest table, founded on the most unreliable basis, 
appears, to the casual observer, equally valuable with 
a table compiled after months of labor by a corps of 
skilful statisticians. To judge of the value of a statis- 
tical presentation, it is, then, usually essential to know 
something of the author and his reliability and skill as 
a statistician. Yet, to a careful observer, the internal 
evidence in the table itself may be a valuable clue as to 
its merit. An amusing example of failure to observe 
intelligently is found in the book of a recent writer on 
socialism, the main thesis of which is based on an 
erroneous table taken from a government report, the 
errors in the table being so glaring as to be at once evi- 
dent to anyone in the least familiar with statistical data. 

It is true that one can prove anything by statistics 
but he can only do so by unscientific handling of his 
data or deliberate manipulation of the figures with the 
purpose of showing the desired result. It is also a fact 
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that some sets of figures are hard to analyze and their 
meaning is often doubtful and debatable but, in a large 
percentage of'cases, a clear and indisputable result may 
be arrived at if the analysis is conducted in a scientific 
and unbiased manner. The science of statistics, then, 
is a most useful servant, but only of great value to those 
who understand its proper use. 

Sec. 18. Progressive Accuracy in Statistics. 

As we have seen, the value of statistics depends 
primarily on the accuracy of the figures. Accurate 
figures are often very hard indeed to obtain. Is there 
any excuse, then, for the preparation of any statistical 
table or statement based on figures of doubtful accuracy 
or reliability? As a matter of fact, such tables and state- 
ments have marked scientific worth though scientific fair- 
ness should always lead the author to accompany the 
table with a statement concerning his sources of informa- 
tion, and the likelihood of accuracy or error therein, in 
order that his reader may not be misled. The value of 
such investigations is chiefly in the fact that they may 
serve as foundations for future work of a more accurate 
nature. Every preliminary investigation shows up the 
difficulties to be overcome, the weakness or strength 
of methods used and some general ideas as to the prob- 
able result. Most detailed investigations are difficult, 
if not impossible, without some such preliminary work. 
The measurements of the velocity of light only attained 

their present accuracy through a long series of approxi- 
4 
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inations. The perfection of modem life tables was only 
attained by a succession of improvements on the crude 
t^les drawn up by Casper Neumann from the church 
records. Therefore, inaccurate investigation is only 
to be condemned when its results are stated, not as a 
preliminary or tentative basis for further work, but as a 
final and definite conclusion. 

Sec. 19. Limitations of Statistics. 

Statistics, while an extremely useful tool to the 
investigator in almost any line of scientific inquiry, has 
its limitations and shortcomings which cannot be over- 
come. Statistics largely deals with averages and these 
averages may be made up of individual items radically 
different from each other. In the average, these ir- 
regularities are all swallowed up. Methods of analysis 
have been devised, as we shall see later, which partially 
obviate this defect but no system which makes a large 
and complex group intelligible to the mind at a glance 
can avoid effacing most of the minor irregularities. 
We usually assume, and in general correctly, that these 
items are of no importance, but this assumption is not 
always in accordance with the facts. It may be true 
that the match industry includes but an insignificant 
fraction of our working population and it may also be 
true that but a small fraction of these workers suffer 
from phosphorus poisoning but, for those affiicted, the 
fact that their numbers do not affect appreciably the 
general average does not lessen their torture, does not 
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appear to them to be and is not a suiEcient reason for 
foregoing legislation to remedy the evil. Statistics, 
from the very nature of the subject, cannot and never 
will be able to take into account individual cases. 
When these are important, other means must be used 
for their study. 

Sec. 20. Sources of Statistical Information. 

There are several ways in which the statistician may 
proceed to collect the necessary data for his work. The 
first method is that of individually collecting data for 
himself. In this case, of course, it is usually only pos- 
sible to take samples of the mass of items to be analyzed 
but the investigator has the advantage of knowing 
exactly the conditions of the investigation and the 
accuracy of sampling is entirely within his own hands. 
This method is largely used in the natural sciences and, 
to some extent, in sociological and economic investiga- 
tions but is usually impossible in the latter lines if the 
field to be covered is extensive. 

A modification of this method is for the investigator 
to employ a number of enumerators or correspondents 
and compile the results of their counts or estimates. 
This has actually been done by the large speculators on 
the Chicago Board of Trade in order to get advance 
information as to crop conditions of the world. Evi- 
dently, this procedure entails a heavy expense and is 
beyond the means of most scientific inquirers. 

A second mode of obtaining information is to turn to 
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the enumerations and estimates made by other private 
statisticians, compare their work, and compile the re- 
sults. This is difficult and usually impracticable be- 
cause of lack of information as to the sources of their 
material, and the methods and accuracy of their in- 
quiries and also because the different investigators did. 
not cover similar fields or have similar ends in view. 
As a result of these difficulties, most of the niunerical 
research work in the fields of economics or sociology 
must proceed by aid of government reports. It is only 
the government which usually has both the means and 
the inclination to collect unbiased statistics concerning 
its subjects and their activities throughout the country 
as a whole. During the last few decades, the variety of 
data gathered by the Census and other Statistical 
Bureaus of the different nations has been so extensive 
and varied as to afford a mine of valuable material 
for the patient investigator in the field of the social 
sciences. An additional advantage which the govern- 
ment has over individuals in carrying on its inquiries 
is that it may use compulsion in obtaining information. 
Naturally, this can in no way insure that the answers 
to questions will be truthful but it helps in overcoming 
the inertia and negligence of the informants, two of the 
most serious obstacles in private investigations. 

Sec. 21. Phases of Statistics. 

As we have already noticed in studying the histori- 
cal development of the subject, there are several 
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different phases of statistics each of which is or has 
been emphasized by certain schools at certain times. 
Historically, the development was as follows: 
I. Empirical Statistics. 

(A) Used principally as aid to administration. 
II. Comparative Statistics. 

{A) Used as a basis of economic doctrine. 
{B) State poUcies based thereon. 
III. Analysis of Statistics by Scientific Methods. 

(A) Statistics now better adapted to verify 

economic, social, and scientific hypoth- 
eses and theories. 

(B) Value as a guide to governmental action 

greatly enhanced 

At the present time, two distinct branches of sta- 
tistics may be discerned. 

I. Statistical Method. 
II. Statistical Information. 

A knowledge of the first of these is essential for 
the statistician in order that he may correctly obtain 
the second. The latter is the branch which is of 
interest to the general public. To use a simile from 
the field of engineering, the public cares nothing for 
the course of mathematics which the engineer must 
pursue in order to correctly construct a great bridge. 
It is interested only in the results. To the engineer, 
however, the mathematics is of prime importance. 
To attempt to handle statistics properly without a 
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knowledge of statistical method is only a little less 
absurd, though vastly more conmion, than to attempt 
to build great steel bridges without a knowledge of 
trigonometry. This volume is primarily devoted to 
the study of those elementary methods, a knowledge 
of which is essential to all desiring to engage in statis- 
tical work at first hand, especially in the realm of 
the social sciences. 
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PART II. 
THE GATHERING OF MATERIAL, 



CHAPTER IV. 

THE PROBLEM TO BE SOLVED. 

Sec. 22. Defining the Problem. 

The first thing upon which the statistical investi- 
gator, when beginning his work, must decide is the 
exact nature of the problem which he desires to solve. 
Even a slight change in its scope or form may require 
an entirely or partially different method of procedure. 
If, for illustration, a person wishes to begin a study 
of comparative wages in order to demonstrate some 
general theory or proposition, he must first decide as 
to whether the requirements of his problem demand a 
knowledge of money wages or real wages. Next, 
he must be sure as to whether he needs to know the 
wages paid for a definite amount of eflfort, for making 
a certain product, or for working a certain length of 
time, or whether the inquiry relates to the income of 
the working man himself per year or to the total 
income of the man and his family for the same period. 
Each of these problems is a distinct one and would 
require entirely diflferent methods of determination. 
The first essential then is to make the problem definite 
and clear-cut. 

39 
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Sec. 23. Selection of Factors of Problem. 

In almost every statistical problem, some comparison 
is involved and this comparison can usually be ex- 
pressed as a percentage or a ratio. If we wish to 
compare death rates in various cities, we usually 
speak of it as a certain number per thousand of inhab- 
itants; in referring to growth of cities, we use percent- 
ages while, in giving a comparative record of suicides, 
we would probably state the results as a certain num- 
ber per hundred thousand. In either of these cases, a 
numerator and denominator are required and the 
quotient may be referred to as a coefficient. It is 
of the utmost importance that the terms of this frac- 
tion be selected with the greatest care for, if these 
are erroneously chosen, the final comparison will be 
vitiated if not rendered worthless. It is this variety 
of error which is most subtle and most likely to deceive 
the mvestigator himself and the results are so specious 
that the public naturally accept them at their face 
value and later, when their falsity is demonstrated, 
another impetus is given to the feeling of distrust 
toward statistics in general. 

An example of such a fallacy, due to the use of er- 
roneous factors, was furnished by a newspaper in a 
discussion of the American navy during the Spanish- 
American war. It was stated that the death-rate in 
the navy during the war period was only nine per 
thousand while in the city of New York for the same 
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period the death-rate was sixteen per thousand. The 
conclusion was drawn that it was safer to be a sailor 
in our navy in war time than to live in New York City. 
A little reflection, however, will convince one that such 
a conclusion is not warranted by the figures given. 
In obtaining this ratio, the total number of deaths was 
taken as the numerator in each case and the denom- 
inators were respectively the total number of persons 
living in New York City and the total number of sailors 
in the navy. But, as a matter of fact, these numbers 
were wholly incomparable. It is a well known fact 
that the death rate is very high among young children 
and among old people. But the personnel of the navy 
is composed almost wholly of young men in the prime 
of strength and vigor. Not only this, but each must 
pass a strict examination to show that he is healthy 
and robust. Thus, the weak and diseased are elim- 
inated. Evidently, the facts would require that the 
death rate in the navy be compared with the death 
rate of a similar picked body of men in New York City 
before any legitimate conclusions could be drawn re- 
garding the comparative chances of death in the two 
places. 

Similarly, if one is desirous of comparing the number 
of murders in the Klondike with the number in Chicago 
in order to draw conclusions concerning the general 
propensity of the people of each place toward violence, 
it would be entirely improper to divide the number of 
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murders in each locality by the respective populations. 
Murders are, in the great majority of cases, conamitted 
neither by women, children nor the old and decrepit. 
Again, while the numerators have been correctly se- 
lected, the denominators should be approximately the 
respective numbers of the male population between 
the ages of sixteen and sixty in each place. 

Bertillon, the eminent French statistician, gives the 
following rule for coefficients — "Always compare effects 
to the causes producing them." ^ This idea may be 
restated thus — ^Be careful to so select the quantities 
used as numerator and denominator, in each case, 
that the quotients derived may be legitimately com- 
pared. 
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CHAPTER V. 

THE STATISTICAL UNIT. 

Sec. 24. Determining the Unit. 

The science of statistics deals with numbers and 
nmnbers pre-suppose units. At first glance, the de- 
termination of a unit seems a very simple matter indeed 
but, more often, the opposite is true. Before a definite 
conclusion can be arrived at, it is usually necessary to 
take into consideration the nature of the result desired. 

As we have seen, one of the oldest forms of inquiry 
is the counting of the people. In the United States, 
we have the decennial census whose original purpose 
was to determine the population of each of the states 
in order that representatives might be correctly appor- 
tioned among the same. To this end, the Constitution 
provides, in Amendment XIV, Sec. 2, "Representatives 
shall be apportioned among the several States according 
to their respective numbers, excluding Indians not 
taxed." The question arises as to whether or not the 
makers of the Constitution intended this to be taken 
literally. For purposes of Congressional apportion- 
ment, would a person include an Indian laboring in a 
city if his name did not appear on the tax roll or does 
it refer only to Indians on reservations? Would the 
term include the half-breed on the reservation? If so, 
does it take in the person of one sixteenth Indian blood 
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or, indeed, the "squaw man"? Will this unit — ^the 
person — include the French traveller who happens to 
be in New York on the day of the census? If it refers 
only to residents, will it count the Dakota farmer who 
is doing his week's trading across the border in Canada 
when the enumerator appears on the scene? If he is 
included, how about his son who is with the engineer 
corps in Panama for a few months? These few ques- 
tions will show the difficulty of defining such a simple 
unit as a person according to the requirements of our 
Constitution. 

The Census Bureau attempts to find the number of 
farms at each decade. How is the term to be defined? 
Is or is not a five acre market garden a farm? Does the 
term include the thousands of acres of government land 
ranged over by the cattle of a ranchman owning a single 
quarter section? If a man owns two eighty-acre tra6ts 
half-a-mile apart and works them both, do they con- 
stitute one or two farms? If his hired man lives on 
one eighty does that change the status? If he rents 
one eighty to a tenant what is the effect? If he has a 
large plantation with a dozen tenants thereon all under 
his superintendence, how many farms are there? These 
difficulties are better illustrated in the introduction to 
Volume I on Agriculture in the U. S. Census of 1900, 
pp. xiii to xvii. 

But if such units as a person or a farm are hard to 
define expUcitly, what about the difficulty when such a 
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unit as a criminal is in question? To be a criminai 
must one be guilty of felony? What about the man who 
commits murder but bribes the jury to acquit him? 
Manifestly, in this case, we arrive at obstacles that are 
all but insurmountable, yet, a definition of a criminal 
seems indispensable before one can obtain any com- 
parative statistics of crime. 

That the unit as finally decided upon shall correspond 
to the name given it is highly desirable, but, even if 
the suitableness of the name is in doubt, it is not only 
desirable but strictly essential that the unit be accu- 
rately and unmistakably defined and that the same 
unit be used in each of the periods or places between 
which it is intended to make comparisons. In order 
that the unit shall be perfectly explicit, it is necessary 
that its definition shall, before beginning the investi- 
gation, be worked out in minute detail so as to cover 
all imaginable questions which may arise concerning it. 
If the person defining the unit does not intend to con- 
duct the investigation himself but expects to employ 
enumerators, it is essential that the imit be described 
in such clear terms and that the details of the defini- 
tion be so conveniently outlined and arranged that the 
enumerator can easily find and comprehend every 
sentence of the instructions. Eniunerators are of only 
ordinary intelligence and, if any considerable number 
are employed, the least ambiguity is certain to give 
rise to confusion 
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Sec. 25. Necessary Characteristics of the Unit. 

Not only must the unit selected be defined with 
precision but it must also be of such a nature that it 
may be correctly ascertained. Suppose that one is 
interested in determining the comparative education in 
two communities. It would be absurd to select as the 
unit for the numerator the educated person, for it 
would be impossible to define a imit that would fit the 
title and then locate the persons corresponding to the 
definition. Some simpler unit must be used for the 
numerator based on some tangible measurement such 
as a college degree, a certificate of graduation from a 
high school, a certain nmnber of years' schooling, 
ability to read and write, familiarity with a certain 
brief Ust of facts, or some other specific evidence rather 
than on the inner characteristics which we think of when 
we refer to a man as educated. In brief, then, the ab- 
stract must be measured by its concrete manifestations. 
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CHAPTER VI. 

PLANNING THE COLLECTION OF DATA, 
Sec. 26. Preliminary Plans. 

Before the actual collection of material is begun, 
every phase of the question should be carefully studied 
in order that no energy should be wasted, errors reduced 
to a minimum, and the necessity for a second inquiry 
be avoided. In fact, one of the pecuUarities of statis- 
tical work is that practically everything must be antici- 
pated in advance, all possible sources of error detected 
and guarded against, and even the general results 
estimated. Problems, factors, units, questions, sched- 
ules, enumerators, tabulation, methods of work, time, 
expense, etc., are among the items that must be carefully 
gone over in minute detail. Statistical work is tedious, 
at best, and errors and misunderstanding are likely to 
occur in spite of all precautions, but each hour spent 
in carefully prearranging the work is likely to save a 
score of hours in trying to straighten out the confusion 
due to a hasty and ill-advised program. 

As has been said before, several methods of investiga- 
tion are possible which may be broadly classed under 
two general heads, primary and secondary. 

I. Secondary Investigation. 

Sec. 27. Characteristics. 

For this sort of an investigation, the preliminary 

work that can be done is slight. Nearly everything 

47 



48 ELEMENTS OF STATISTICAL METHOD. 

depends on the material collected and plans for the 
collection are therefore difficult to formulate* 

II. Primary Investigation. 

Sec. 28. General Characteristics. 

In the case of a primary investigation, circumstances 
are radically different for, in this, the director of the 
work can make such plans for collection as he believes 
best adapted to the required end. Four general plans 
are possible, personal investigation, estimates from 
correspondents, schedules to be jQlled by the informants, 
and schedules in charge of enumerators. The proper 
method is, of course, determined by the nature of the 
problem, the accuracy of results desired, and the finan- 
cial resources available. 

Sec. 29. Personal Investigation. 

This method is especially adapted to intensive 
studies. A good example of work along this Hne is 
that conducted by Le Play, in Europe. He studied 
workingmen's budgets by spending several months in 
the home of a single family of working people and 
repeating the process with a number of famiUes. By 
following this method for many years, he obtained 
statistics of great accuracy but the number of families 
which it is possible to study in this way, in a reasonable 
amount of time, is too small to constitute a fair sample 
of the whole. Le Play, even in a lifetime, could not 
study a very large number of families, but it is note- 
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worthy that large scale investigations conducted since 
that time have not overthrown the fundamental prin- 
ciples which he set forth. Arthur Young, by his travels, 
gives us another variety of personal study of a less 
intensive type. Booth's great work on the "Life and 
Labour of the People of London" is also partially a 
personal work and the same may be said of Rowntree's 
studies in York, England, in 1900, but these are on a 
less intensive basis than those of Le Play. 

This type of inquiry, while admirable because of 
additional accuracy due to personal supervision, must 
needs cover too narrow a field to be representative and 
is also liable to too large an injection of the personal 
element. The prejudices and desires of the investigator 
become too often unconsciously woven into the fabric 
of his conclusions. 

Sec. 30. Estimates from Correspondents. 

When it is desired to obtain only an approximate 
result, this method is often used because of its ease and 
inexpensiveness. As has been said, this is a favorite 
method of obtaining crop reports, estimates usually 
being stated as a percentage of increase or decrease 
from the normal or from the preceding year. While 
individual repoits are necessarily quite inaccurate, the 
errors involved tend to compensate each other and, 
when a large number of reports are returned, the net 
results are likely to be approximately correct. A modi- 

5 
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fication of this plan is that in which agents are sent 
through the country to collect the estimates. 

Sec. 31. Schedules to be filled by Informants. 

This is another extensive method and diflfers from 
the preceding only in that the questions asked are those 
concemmg which the mformant is presumed to have 
definite, accurate, knowledge. Like any method rely- 
ing on correspondents, it has the serious defect of 
depending for its success on persons whose interest in 
the work is, to say the least, not acute. A large per- 
centage of schedules are usually not returned unless 
they emanate from the state or some of its representa- 
tives endowed with, and actually exercising, compulsory 
powers. Those schedules that are returned are often 
. extremely incomplete and full of errors. If they are 
very simple, the probabilities of receiving a reasonable 
percentage of fairly correct schedules is greatly aug- 
mented. The average informant is surprisingly ignorant 
and careless in matters of this kind and questions must 
be made much more simple than if they were to be 
placed in the hands of enumerators. A schedule of 
this sort should always bear a statement of its exact 
purpose and the person or authorities responsible for 
the inquiry. Otherwise, suspicion and prejudice will 
unite with the natural inertia of the informant and 
replies will not be forthcoming. Questions for this 
variety of schedules should usually deal with present 
facts, only, for it is practically hopeless to get records 
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of the past which are accurate enough to warrant the 
trouble involved. 

The main advantage of this method is that a large 
territory may be covered at only a small fraction of the 
expense necessary to pay for sending out enumerators. 
If a reasonable number of well-filled schedules are re- 
ceived, they constitute good samples which are generally 
representative of the results as a whole and hence the 
work may be completed with tolerable accuracy. 

This plan is extensively used by private individuals 
and also for government reports. Statistics of wages, 
unemployment, local expenditures, weather reports, 
etc., are regularly obtained in this manner. Where 
there is a law requiring the filling of the schedule and a 
legal penalty attached for neglecting this duty, very 
satisfactory results are often forthcoming. Even volun- 
tary reports, as of the weather, frequently furnish quite 
regular and reUable data. This is much more often 
true in regard to reports sent in by picked observers 
at regular intervals than of those sent in only once. 
Most of the rules for schedules and questions given 
under the next title are also applicable to the foregoing. 
We shall discuss these under the head 

Sec. 32. Schedules in Charge of Enumerators. 

iThis is the plan followed in the leading governmental 
investigations and is usually too expensive to be under- 
taken by private initiative. It is unquestionably the 
best plan for most kinds of extensive inquiries. 
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In this kind of an inquiry, the schedules may be much 
more complete than in case of those sent directly to 
voluntary informants and hence the scope of the inquiry 
may be greatly enlarged. However, care should be 
taken that the schedules should be of convenient size 
, and form for the enumerators to handle, not unwieldly 
folded sheets of large size which' are hard to manipulate 
and are easily torn. The schedules should also be 
spaced and ruled in such a manner as to enable the eye 
to easily follow the line or column across the paper. 
Headings and subheadings should be placed in proper 
relationship to each other and the type used be such as 
to bring out the distinctions properly. Every heading, 
form and title should be so lucid that any person of 
ordinary intelligence can fully comprehend its signifi- 
cance. Each word and phrase should therefore be 
carefully scrutinized for possible double meanings or 
debatable interpretations. Care should also be taken 
to indicate in the headlines the exact degree of accuracy 
to which each nimierical result is to be read. These 
simple precautions will prevent much needless confusion 
and loss of time to the enimierators as well as many 
unnecessary errors. 

A sample schedule of occupation and wages is given 
opposite, this form being intended for cases in which 
the information is to be gathered from the wage earners 
and not from the employers. Of course, full instruc- 
tions for the interpretation and use of the schedule as 
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well as a sample schedule properly filled out should be 
part of each enumerator's equipment. 

Another type of schedule very popular at present is 
the individual card. In this case, the information con- 
cerning each separate individual is placed on a card by 
itself. This plan is especially desirable in case that it 
is intended to later group the different items in the 
schedule in different orders. For example, if, in the 
appended schedule, one desired first to classify the 
workers according to occupations and later according 
to wages and imemployment the grouping process would 
be greatly facilitated by the use of cards. The card sys- 
tem is almost the only feasible one where the record is to 
be continuous and constantly expanding. On the other 
hand, cards are much more bulky, require larger space 

for filing and are more inconvenient to total. For the 
purposes of the U. S. Census and other extensive 

investigations, a combination of the two methods is 
used. The data are first entered on schedules by the 
enumerators, and later punched on properly prepared 
cards on which titles are represented by nimibers, and 
then tabulated by intricate electrically-operated ma- 
chines. The method to be adopted is, of course, 
dependent on the specific characteristics of the investiga- 
tion in question. 

Sec. 33. The Choice of Questions. 

In selecting the questions to be &sked of the infor- 
mant, one must differentiate between the cases in which 
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the filling of the schedules is left to the convenience of 
the informants and those cases in which the furnishing 
of the information is i*equired by law. In the first case, 
questions must be very simple, few in number, and 
easy to answer, otherwise, one may feel sure that a 
large percentage of them will go unanswered. In the 
latter case, the niunber of questions may be considerably 
increased but their character of simpUcity must be pre- 
served in order to obtain answers whose accuracy is 
sufficient to be of value. 

When eniunerators are sent out, especially if they 
have legal powers to require answers to their questions, 
the questions may be more complex and more nimierous. 
They must never be so difficult that the enumerator 
cannot correctly interpret them by aid of his printed 
instructions. If, however, the eniunerator thoroughly 
comprehends the question he may, by several related 
inquiries, obtain the required data even where the 
meaning of the original question is not entirely clear 
to the informant. 

The investigator must always bear in mind that each 
additional question means additional expense and extra 
work in tabulation. In an extensive investigation like 
the National Census, a single inquiry entails a cost of 
many thousands of dollars. Under these circum- 
stances, the number of questions is strictly limited by 
the funds available. It becomes, then, a question of 
eliminating the least essential questions and retaining 
those deemed most indispensable. 
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In order to render it possible to tabulate the results, 
it is necessary that the questions shall be such as may 
be answered by yes or no or a simple number. If a 

question asks for the education of the informant an 
infinite munber of over-lapping answers will result such 
as "good," "considerable," "well-posted," "high- 
school," etc. Such answers can scarcely be handled 
statistically. If, on the other hand, the number of 
months attendance at school is asked for, the munerical 
reply will be susceptible of tabulation and will approxi- 
mate the result sought. 

In draughting a set of questions, one must, as far as 
possible, avoid those which are likely to arouse the 
resentment of the informants or those whose answers 
are likely to be affected by prejudice. If hostiUty is 
once aroused, it is difficult to get any further correct 
information. Questions concerning indulgence in stim- 
ulants, phj'^sical or mental infirmities, and the like, are 
good examples of those which provoke antagonism. 
The ages of women are likely to be understated in a large 
fraction of all answers given. When it is necessary to 
ask inquisitorial questions it is a valuable help to check 
the answer by some corroboratory question as, for 
example, to inquire the age in years and, at some later 

point in the inquiry, get the date of birth. A skillful 
enumerator should be able to thus unravel the truth in 
many cases. 
Care should always be taken to see that the questions 
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exactly cover the points desired in the study and are 
not subject to double interpretation. One of the main 
difficulties in using results of other investigations is that 
the questions asked therein were not usually intended 
to cover exactly a similar point and a slight difference 
in the wording may bring vastly different results. If, 
for instance, one were computing the number of train- 
miles travelled on a certain railroad system, it would 
be a matter of decided importance as to whether the 
query was so worded as to include or exclude the dis- 
tances covered by switch-engines. 

To summarize, then, the questions chosen should be^ 

1. Comparatively few in number. 

2. Require an answer of a number or yes or no. 

3. Simple enough to be readily imderstood. 

4. Such as will be answered without bias. 

5. Not imnecessarily inquisitorial. 

6. As far as possible corroboratory. 

7. Such as directly and unmistakably cover the point 
of information desired. 

Sec. 34. Defining the Field. 

Having settled upon the problem, method of inquiry, 
schedules, questions, etc., the investigator must now 
decide upon the scope which the inquiry is to have. 
Both time and space being limitless, certain definite 
confines must be fixed beyond which the study shall 
not extend. If a study of incomes is to be made, it 

^ See Bowley, A L., Elements of Statistics, pp. 18-25. 
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may deal with one city, or several cities, one county or 
several counties, one state or the whole nation. One 
may compare the incomes in the different locaUties at 
the present time or in the same locality at different 
times. 

Sec. 35. Representative Data. 

Private investigators, being usually tmable to cover 
thoroughly as large a field as desired, quite generally 
resort to the method of securing sainple data. If it is 
wished to learn something of workingmen's budgets, no 
effort is made to obtain the record for all the families 
of any one conununity but sample families are taken 
which are supposed to represent the entire field. The 
results thus obtained are likely to be quite satisfactory 
if the instances are niunerous and the sampling has been 
properly done. The study of Professor Chapinof the 
condition of the working people of New York City gjf es 
results in many respects quite in accord with the far 
more extensive investigations conducted by the Bureau 
of Labor, though the niunber of families that he studied 
was necessarily very much smaller than that included 
within the boimds of the government inquiry. There 
is, however, always danger of incorrect sampling owing 
either to accident or to conscious or unconscious manipu- 
lation on the part of the investigator in order to obtain 
the results desired. A Marxian socialist, desiring to 
prove that conditions were growing worse, would be apt 
to select too large a percentage of the poorest families; 
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an optimist would probably take too many instances 
from the more fortunate classes. Proper sampling may 
be secured simply by taking a very large number of 
instances at random but, when the number forms but 
a small fraction of the aggregate, it is better to divide 
the entire group to be studied into classes, ascertain 
as closely as possible the total number in each class, 
and then select samples from the various classes in the 
ratio of their respective nimibers. 

Still another modification of this plan, which is 
scientifically accurate but practically more difficult 
to apply, is to arrange the items as nearly as possible 
in order according to size and select samples at approxi- 
mately equal intervals throughout the series. This 
finds application principally in the field of biology where 
where it is used to avoid dealing with too large a number 
of iteiigis. 

Sec' 36. Selection of Enumerators. 

It is almost mmecessary to remark that the quality 
of an investigation will depend largely on the character 
of the enumerators employed. Intelligence is necessary 
in order that the vague replies of the informants may 
be eliminated and put in shape for recording. But 
intellectual capacity is far from being the sole requisite. 
Diligence and integrity are just as necessary. The 
unscrupulous enumerator will save much effort by filling 
in schedules with fictitious quantities and so vitiate the 
entire result. Those persons directly interested in the 
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outcome are likely, of course, to allow their personal 
bias to enter into their records to a greater or lesser 
degree. In addition, the enumerator should be cour- 
teous and tactful in order that the work may proceed 
smoothly and the correct replies be elicited wherever 

possible. 

REFERENCES. 

Meitzen, August. Statistics, pp. 120-129, 155-185. 
BowLEY, A. L. Elements of StfUistics, Chaps. II and III. 
Yule, G. U. Introduction to StfUistics, Chaps. XIII and XIV. 
Block, Maurice. Traiti de Statistiqii£f Chaps. IX and X. 
Berttllon, Jacques. Cours EUmentaire de Statistique^ Chaps. 

IV and V. 
BowLBY, A. L. Elementary Manual of StatisticSj Chap. VIII. 



CHAPTER VII. 

THE COLLECTION OF MATERLAX. 

Sec. 37. The Secondary Method. 

When the mvestigation is to be of a secondary type, 
it is necessary to exercise considerable care in several 
respects, before making use of the figures gathered by 
others. The first essential is to know something of 
the reliability of the original compiler of the data and 
his ability to get at the facts. They may represent 
mere guesses and, in that case, it is folly to use them 
as a basis for scientific work. If one is satisfied that the 
tables have some real claim to merit he should next 
proceed to determine the following facts as completely 
as possible. 

1. From what sources the figures have been derived. 

2. The definitions of the units, including the instruc- 
tions to the enimierators. 

3. The purpose for which the data were originally 
collected. 

4. The methods used in collecting the same. 

5. The degree of accuracy of the figures. 

These points having been satisfactorily settled, the 

investigator is now able to make use of the figures in an 

intelligent manner. 

Frequently, considerable discrepancies will be found 

in the figures in different parts of the same report. 

61 
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These are often due to the omission of certain parts 
from one total which have been included in another 
similar one. Often a little careful reasoning and ob- 
servation will locate the source of the error. 

Estimates or figures for the same amount taken from 
different sources often differ widely. In this case it 
usually requires much greater effort to reconcile the 
diverging numbers. If both are from apparently reli- 
able sources, it may be worth the effort, otherwise all but 
the best authenticated must be rejected or else the 
estimate be only made accurate to the furthest digit 
in which the different numbers coincide. 

Sec. 38. The Primary Method. 

After the schedules have been returned by the in- 
formant or enumerators, as the case may be, there is 
still, as a rule, much work to be done before the results 
are ready for tabulation. Each schedule must be 
checked over for errors and omissions. Sometimes the 
latter may be supplied by a second inquiry but this 
entails a very considerable amoimt of extra labor and' 
expense and, ofttimes, surprisingly little additional 
information is elicited, most of the original blanks being 
due not to oversights but to some difficulty in answering 
the question. 

If the figures are manifestly erroneous, they must be 
either corrected or rejected. Frequently an entire 
schedule will be found so incomplete or so badly tangled 
that it must likewise be thrown out. It is better, by 
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far, to have a smaller number of correct samples than 

to have a large nimaber of incorrect ones. In the first 

case, the error can often be mathematically corrected 

with approximate accuracy; in the latter case, there is 

no remedy. 
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CHAPTER VIII. 

APPROXIMATION AND ACCURACY. 

Sec. 39. Perfect Accuracy Rarely Attainable. 

As was mentioned in the first chapter, statistics as a 
science deals with estimates rather than with exact 
enimierations. If we wish to measure the total product 
of the coal mines of the United States, it is self-evident 
that the result can only be approximated. Not a single 
carload can be weighed with exactitude. There are 
likely to be errors in the number of car-loads reported 
from the various mines. In addition, a large number of 
small mines are sure to be omitted from the list. Tak- 
ing all these facts into consideration, one readily sees 
that the total may be in error many thousands or 
perhaps even several milUons of tons. In dealing with 
questions of error, however, relative and not absolute 
accuracy is the standard in mind. The production of 
coal for the United States in 1909 was over 397 million 
tons, hence, an error of one miUion tons only amounts 
to about one fourth qf one per cent. For many pur- 
poses, an error of even four or five per cent, might not 
seriously vitiate the result. 

Absolute accuracy is not possible in any case in which 

measurement is involved. By means of an ordinary 

ruler, one might measure the length of a needle in 

whole millimeters. By substituting a simple vernier, 

64 
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the accuracy could readily be increased to tenths of a 
millimeter and, by still more refined processes; the error 
might be reduced to the thousandth part of a millimeter 
but still the measurement would never absolutely accord 
with the length of the needle. More refined methods 
simply give a constantly closer approach to exactitude, 
but never attain it. 

Sec. 40. Standard of Accuracy. 

While, in the physical sciences, very great accuracy 
of measurements is practicable, this is far from being 
true in the case of social phenomena. In this field 
a multitude of sources of error are ever present, many of 
which can be eliminated by no degree of care. For- 
tunately for the statistician, small errors are often 
negUgible and in no way obstruct the solution of the 
given problem. Attempts to obtain the greatest pos- 
sible degree of accuracy are, frequently, merely wastes 
of time. It might be possible to measure the customs 
revenue of the United States to the nearest cent but, 
for ordinary purposes of statistical comparison, such 
accuracy is not only superfluous but positively confusing 
to the mind inasmuch as the addition of extra figures 
diverts the attention of the mind from the fundamental 
digits. 

For every statistical problem, there should be deter- 
mined, in advance, a definite standard of accuracy for 
each item and every endeavor should be made to bring 

each recorded instance up to this standard but this 
6 
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Standard by no means needs to correspond to the 
highest degree of accuracy attainable. In taking an 
age census, it would probably be possible to determine 
the age of most persons to the nearest day but there 
would be no advantage in so doing. Hence, the only 
desideratmn is to obtain data sufficiently accurate for 
the purposes for which they are intended or are likely 
to be used. 

Thus, in the case of a large lake, the area in square 
miles would be accurate enough for all purposes but, 
in the case of a small reservoir, measurements in cubic 
feet or gallons would be appropriate. The nearest mile 
sufficiently approximates the length of a railroad but 
light waves must be measured to milUonths of a milli- 
meter. To repeat, relative and not absolute accuracy 
is the desideratum. 

Sec. 41. Round Numbers. 

Since numerous digits are confusing to the mind, it is 
frequently best to express quantities in round numbers 
even where the exact figures are available. For ex- 
ample, if one wishes to express the comparative popula- 
tions of the United States and China to the ordinary 
audience, it is far better to state the population of the 
United States as ninety millions and that of China as 
four hundred millions than to give the census figures 
for each nation, for the hearers, in trying to sense the 
digits, fail to comprehend the main point which the 
lecturer is trying to convey. 
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This same use of round numbers is allowable in 
popular books and magazine articles and for some 
purposes in scientific works. If one wishes to show the 
comparative amount of steel produced -in the United 
States during the last thirty years, it is perfectly correct 
to tabulate it as follows even if accuracy to a further 
degree is possible. 

steel Prodaoed in 
Hundreds of Thousands 
Tear. of Tons. 

1880 12 

1885 17 

1890 43 

1895 61 

1900 102 

1905 200 

The remaining figures should in no case be given if 
they are statistically inaccurate. If, however, they are 
accurate it is often advisable to state the complete 
numbers in a table so that someone else may use the 
figures in other combinations where greater accuracy 
may be desirable. This being done, round numbers 
may be used in comparisons in the text and the principal 
points brought, in this manner, to the reader's attention. 
This is probably the method most generally applicable 
in scholarly works. 

Sec. 42. Possible Accuracy. 

While it is very easy to determine the desirable 
standard of accuracy, it is by no means possible to 
always bring every item within this limit. The engineer 
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on a geodetic survey may desire to read his angles to 
tenths of a second but his instrument may be such that 
seconds only may be correctly ascertained. The col- 
lector of wage statistics probably desires to know the 
yearly wage of each individual to the nearest dollar but, 
in cases of irregular employment, the chances of getting 
nearer to the correct simi than ten or twenty dollars 
are slight indeed. The geologist would fain measure 
the date of the beginning of the last glacial recession to 
the nearest century but he must be content to approxi- 
mate it in tens of thousands of years. The legislative 
commission is anxious to determine the exact amoimt of 
highway expenditures of the state, but many of the 
reports sent in are so incomplete and confused that 
the closest estimate must probably differ by many 
thousands of dollars from the correct figure. In each 
of these instances, the standard of accuracy is not set 
by the statistician but by circumstances over which he 
has little or no control. In such cases, he should be 
careful that his report shows accuracy only to the point 
actually attainable and not to the point desired or to 
the degree indicated by the most exact of his data. 

Sec. 43. Accuracy in Entering and Reading Figures. 

The accuracy to which figures are read or are correct 
should, in tabulation, be stated in the heading of the 
column or in a footnote. All inaccurate figures except 
the first digit beyond the margin of accuracy should be 
dropped. 
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It is frequently better to carry the digits one place 
further than absolute accuracy justifies for the first 
inaccurate digit often represents an estimate which is 
somewhere near the correct quantity. In such cases, 
its elimination increases the error of the final result. 
For instance, if the length of a leaf is recorded as 2.96 
cm., though it was only possible to read accurately to 
the nearest tenth of a centimeter, it is better to retain 
the final digit 6, since 2.96 is Ukely to be a closer ap- 
proximation to the real length than 3.0 which would be 
the reading if this digit were dropped. If the last cor- 
rect figure is a cipher, it must invariably be entered 
the same as any other figure. Thus, if, in measuring 
the length of a leaf, one is reading correctly to milli- 
meters but expressing the result in centimeters, and the 
leaf happens to be as nearly seven centimeters long 
as can be measured, it must be expressed as 7.0 cm., 
not merely as 7 cm. The latter figure indicates that the 
reading is only accurate to centimeters. In other 
words, any leaf between 6.5 and 7.5 cm. in length would 
in that case be entered at 7 cm. while, if the accuracy 
of reading is to millimeters, and the entry is 7.0 cm. 
it means that the length is between 6.95 and 7.05 cm. 
An entry of 7.00 cm. would similarly show that the 
reading was accurate to hundredths of a centimeter and 
that the leaf length was between 6.995 and 7.005 cm. 
. When, for any reason, it becomes necessary to drop 
certain digits of a number in order to bring all items 
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to a uniform standard or because certain digits are 

beyond the limit of accuracy, one should always be 
careful to see that the remaining digits are correct. For 
instance, if it is desired to reduce the following to a 



the results would be as 


follows: 




Original Nnmber. 




Correct to One 
Decimal Place. 


27.25001 




27.3 


27.249987 




27.2 


18.20995 




18.2 


18.9478 




18.9 


18.95172 




19.0 


19.09162 




19.1 


24.05002 




24.1 


23.04997 




23.0 



All fractions over half are, in every instance, counted 
as whole numbers and all under half are discarded. 
Those exactly equalling one half may be retained or 
dropped at discretion. 

Sec. 44. Possible Accuracy as a Result of Various 
Mathematical Operations. 

An exceedingly common error is to give to figures a 
large degree of fictitious accuracy which arises simply 
from some mathematical operation. Take the follow- 
ing example: John is seven years old, Harry nine and 
George is six. Find the average age of the boys. The 
student is likely to proceed in this fashion. 

7 + 9 + 6 = 22, 
22 4- 3 = 7.333333333 yrs. old. 
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Evidently, the number of decimal places is limited 
only by the student's industry or his sheet of paper. 
The answer will, oftentimes, be accepted as correct yet 
a moment's reflection must demonstrate its absurdity. 
John's age is only stated in even years. If given cor- 
rectly, he may lack five months of being seven years 
of age or he may be seven years, five months and 
twenty-nine days old. The same is true of the ages 
of the other two boys. Not one of the items, then, is 
given with greater accuracy than to the nearest year 
and the average could not possibly attain great exact- 
ness, yet, our answer purports to state the average 
age to the billionth part- of a year. One must guard 
against such fictitious accuracy whenever numbers con- 
taining decimals, or giving a decimal as the result, are 
multiplied, divided, raised to a power, or the root ex- 
tracted. The following discussion may be helpful in 
determining the accuracy of results obtained through 
mathematical operations. 

Accuracy in Multiplication. 
If m = the multiplier. 

n = the multiplicand, 

X = the possible error of the multiplier. 

y — the possible error of the multiplicand. 
Then 

(m + x) (n -\- y) = mn + fny + nx -\- xy, 

(m — x) (n — y) = mn — my — nx -{■ xy. 
The product evidently then is equal to 

mn -[• xy =^ (my + nx) 
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But xy \a BO small compared to the quantity my + nx that 
it may ordinarily be neglected and the product be considered 
simply as mn ^ (my + nx). 

The error therefore (my + nx) can be readily calculated and 
the accuracy of the product, and the number of correct digits 
therein determined. 

Example (accurate digits italicized) : 

Possible error of first factor is 0.5 =« x, 
Possible error of second factor is 50 » y. 

m = 726. 
n = 10,200. 

The product, however, equals mn -\- xy ^ {my + rtx). Substi- 
tuting 

7,405,200 + 25 :*: (36,300 + 5,100) = 7,405,225 =fc 41,400. 

li xy \& neglected as it may well be owing to its small size, we 
get as the product 7,405,200 * 41,400 which equals 7,446,600 
or 7,363,800. 

The greatest absolute accuracy is therefore 7,^00,000. 

AcCTjRACT IN Division. 

If a — the dividend. 

d = the divisor. 

X = the possible error of the dividend. 

y — the possible error of the divisor. 
The quotient evidently lies between 

a -\- X J a — X 

J ana %- -. — . 

d — y d + y 

But 

a + X __ g — a? _ (a -}-x){d-\-y) — (a^ x) (d — y) __ 2dx -f- 2qy 

d-y d + 2/~ d^-y^ " d* - 2/« ' 

The possible error is evidently nearly equal to half of this 

... dx + ay 
quantity or ^^r^ . 
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Therefore, the quotient approximately equals -r ^ ~ mZ.a ' 

By determining the possible error, the number of absolutely 
accurate digits may be determined. 
Example (accurate digits italicized): 



In this case. 



IMO -^ .012 

a = 1440 
d = .012 

y = .0005 

The quotient equals 

a ^ dx+ay ^ 1,440 ^ .012 X 5 + 1,440 X .0005 
d — d* - 2/2 012 .000144 - .000,000,25 

7R 
- 12^'^ " :008li375 = ^20,000 - 5,426 + 

^ 125,426 + or 114,574-. 

Therefore, the result is strictly accurate only to hundreds of 
thousands but with a strong probability of accuracy to tens of 
thousands place since the possible error is 5,426 + and the 
probable error much less. 

Accuracy in Square-root. 

If n = the nimiber whose root is to be extracted, 
e — the possible error of the nimiber. 

Then the correct root evidently lies between ^n ■\- e and 
Vn — e and the possible error is approximately Vn — Vn — e. 

This quantity is but a fraction of "Whence the error is greatly 
reduced by extracting the root. 

Illustration (accurate figures italicized) : VT^T^OO lies between 
Vl4,450 and Vl4,350. The possible error is Vl4,400 - Vl4,350. 
This equals 120.0 - 119.79 -f = 0.21 -. But the possible error 
of the original number equals 50 and the square root of 50 is 7 +. 
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Hence, the possible error of the square root is much less than the 
square root of the possible error of the original number. 

ACCURACT OP A SqUARIL 

lin — the number to be squared, 
e — the possible error. 

Then, the correct square must be between (n + c)* and 
(n - e)K 

But (n + e)» = n» + e' + 2n«, and (n - e)» = n' + 6» - 2ne. 

Evidently, the square of the quantity then equals n' + 6' =*= 2ne. 
For most purposes, e* is so small as to be negligible and the result 
may be stated approximately as n' ^ 2ne and the number of 
correct <ligits be thus readily ascertained. 

Example (accurate digits italicized) : 

{IJSQOy = (1,200)« :*: 2 X 1,200 X 50 = 1,440,000 ^ 120,000. 

Therefore, the possible limits of the square are ap- 
proximately 1,560,000 and 1,320,000, the result being 
accurate only to millions place. 

It should be noted in each of the above cases that the 
possible errof by no means corresponds with the prob- 
able error. The chances that the error will be the 
greatest possible are comparatively slight,^ hence, as 
stated in Sec. 43, in entering products, quotients, or 
roots, one more digit should always be added than 
could be done if absolute accuracy to the last digit were 
required. By entering an extra digit, the probable error 
of the result is greatly diminished. 

^ For a discussion of the theory of error see Bowley, A. L., 
Elements of Statistics, page 26d f . 
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Sec. 45. Compensating vs. Cumulative Errors. 

The accuracy of the final results depends, very 
largely, on whether the errors involved are of the com- 
pensating or cumulative type. If different people were 
to estimate the length of a given line, the chances 
would be that as many would estimate it too long as 
too short. The errors in measuring a Une made by a 
pair of chainmen because of stretching the chain too 
tight or not taking up the slack sufficiently would tend, 
in the long run, to offset each other. The estimates of a 
thousand observers as to crop conditions compared to 
the previous year, while in no case accurate, would, 
taken together, tend to quite closely approximate the 
correct result. These are simply concrete applications 
of the law of statistical regularity. 

On the other hand, if the chain used by the above- 
mentioned surveyors was too short, the longer the line 
measured, the greater the error would become. If 
certain reports of expenditures are missing, the large 
number of items present will in no way tend to offset 
those omitted. If women are prone to state their ages 
too low, the matter will not be remedied beqause millions 
of the sex are counted. The logic of trying to correct 
cumulative errors by mass of data is illustrated by the 
-pun of the wag who remarked that a certain restaurant 
keeper was losing a little money on each meal but 
made it up because he had so many patrons. We may 
say then, in conclusion, that when the number of items 
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is large, compensatiiig errorsy if relatively smalli are 
negligible but, on the other hand, cumulative errors 
always seriously affect tiie accuracy of tiie total or the 
average. 

Sec. 46. Accuracy of Totals. 

The strength of a chain is determined by its weakest 
lisk. Similarly, the total can be no more accurate than 
its most faulty item. No amount of compensation can 
overcome this fact. If a hmidred railroad companies 
report their expenditures to the nearest cent and a 
single line gives its figures only to the nearest thousand 
dollars, it is impossible to state the total expenditure 
for railroad companies closer than in thousands of 
dollars. If this one company reports that it spends 
$2,673,000, we only know that the amount spent was 
somewhere between $2,672,500 and $2,673,500. Any 
amount between these figures would be correct accord- 
ing to the statement of the company. If the sum of 
the figures reported by the other companies were 
$16,295,472.16, we cannot legitimately add thereto 
$2,673,000 and then state the total in dollars and cents. 
We known that this answer is likely to be $500 in 
error either way. Hence, the correct form of the 
operation would be as follows: 

$16,295,472.16 

2,673,000 approximately. 

$18,968,000 correct approximation. 
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However, in obtaining the sum of several amounts 
one must not make the opposite mistake of dropping 
all digits beyond the point of accuracy of the least 
accurate item. For example, in adding the following 
column, the correct result is not 46, as one might 
suppose at first thought, but should be stated as below. ' 

6.321 • 

2.4926 
21.4632 

8. 

7.3875 

2.426 
48.0903 summation. 

48 correct approximation. 

To have omitted the decimals, would have introduced 
an unnecessary error of two whole units into the total. 
Therefore, all correct figures in the separate items 
should be retained and the rejection of the inaccurate 
digits be made only in the total. 

Sec. 47. Accuracy of Averages. 

We have seen that the absolute accuracy of a total 
can be no greater than that of the most inaccurate 
item composing it. Wq shall now consider the absolute 
accuracy of an arithmetic average 

If mi, m2, ms, etc., are the estimated quantities, n in 
number, their respective errors being 6i, €2, 63, etc., 
then the estimated average is Xm/riy but the largest 
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possible average is 

(mi + ei) + (mt + ea) + (m> + e») - -- + (m^ + Cn ) 

n 

Sm + 26 Sm , Se 

or or 1 — . 

n n n 

Therefore, the average may be correctly written 
2w/n ^ Se/n. But Se/n would be the average possible 
error of a single item. Hencei the possible error of 
an arithmetic average is equal to the average possible 
error of the items in the series. 

In obtaining this possible error, we assumed that 
all the errors were in a similar direction. As a matter 
of fact, however, the chances that this will be true 
are very remote when the number of items is large 
and the errors are of the compensating type. The 
probable error of the arithmetic average is therefore 
only a fraction of the possible error. HE — the 
possible error of the arithmetic average, the probable 
error of the same is approximately E/ -^.^ 

This fact that the probable error of the average of 
a niunber of items is less than that of any single item is 
of great value in scientific work. The physicist 
makes a large number of observations of the same 
phenomenon and averages the results. The sur- 
veyor repeats the measurement many times in deter- 
mining an angle accurately. The sociologist obtains 
observations in many different localities in order that 

^ For proof see Bowley, Elements of Statistics, pp. 303-315. 
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peculiar conditions in one place may be offset by reverse 
surroundings in another. Even the personal bias 
of the observers tends to be eliminated by the aver- 
a^ng process. 

Still, one must not go to the opposite extreme and 
conclude that an average always reduces the error 
to a negligible quantity. Biased errors are not mutually 
corrective and hence are in no way reduced in the 
average and even unbiased errors may be large enough 
to greatly vitiate the average. When the items are 
few in niunber, the effect of errors is especially serious. 

Sec. 48. Locating the Decimal Point. 

In performing mathematical operations with a 
slide rule, one obtains merely a sequence of figures 
and the beginner is often puzzled as to the correct 
location of the decimal point in the result. One 
should learn to determine this correctly by inspection 
but, until this is accomplished, it may be properly 
placed by means of the following empirical rules. 

Multiplication. 

1. Consider the first significant digit in the multi- 
plier as a unit and the, remaining figure^ as decimals 
of the same. Do the same for the multiplicand. 

2. Obtain a mental product of these two numbers 
and note whether it contains one or two integral digits. 

3. Add to the number of integral digits thus obtained 
the total number of integral digits not used in the pre- 
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liminary multiplication contained in the multiplier and 
multiplicand taken together and, in case either the 
multiplier, multiplicand, or both should be wholly 
decimal, subtract from the sum thus obtained the 
sum of all ciphers preceding the significant digits in 
the multiplier and multiplicand and the decimal points 
in each entirely fractional factor. The result, if posi- 
tive, indicates the number of integral digits in the 
product; if negative, the number of ciphers preceding 
the first significant digit of the product. 

Example I. 

42,000 X .025 gives 105 as the sequence of figures. 

4.2 X 2.5 = 10 + or two digits. 

The multiplicand contains four other integral digits. The multi- 
plier contains a decimal point and one cipher. 

2 + 4 — 2 = 4, the number of integral digits. 

The product, therefore, equals 1,050. 

Example II. 

.036 X .024 gives 864 as the sequence of figures. 

3.6 X 2.4 = 8 + or one digit. 

The factors contain two decimal points and two initial ciphers or 
four digits. 

1 ~ 4 ss — 3 or the number of initial ciphers in the product. 

Hence, the product equab .000864. 

Division. 

In division, as well as in multiplication, it is usually 
possible to locate the decimal point in the result by 
inspection, but, in case there is diflSculty in so doing, 
the following rules and table may prove of assistance. 
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Considering digits to the right of the decimal point 
as negative and to the left as positive, count from the 
decimal point to the first significant figure in the divi- 
dend and record the result. Repeat for the divisor. 
From the number determined for the dividend, subtract, 
algebraically, the number found for the divisor and add 
to the remainder the nimiber found under the proper 
headings in the table below. 

Let D =» first significant figure of dividend. 
d = first significant figure of divisor. 

TABLE II. 
Determination of Decimal Point in Division. 



Characteristics of Diyidend and 


Diyidend Equal to 

or Larger than 

Divisor. 


Diyidend Smaller 
than Divisor. 


Divisor. 












2) = dor 
D>d 


D<d 


2> = dor 
J>>d 


D<d 




r Divisor 










Dividend 


unity or 
greater 


+1 








-1 


unity or 
greater. 


Divisor 
wholly a 
decimal 

Divisor 





-1 






Dividend 
wholly a - 
decimal. 


tmity or 
greater 

Divisor 









-1 




wholly a 
decimal 


+1 








-1 



Examples: 

1. .002 -^ .04 

In this case we obtain for the number of the first significant 
7 
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digit in the dividend — 3 and for the number of the first signifi- 
cant digit in the divisor - 2. But (- 3) - (- 2) = - 1. The 
dividend is a decimal, the divisor a decimal, the dividend is 
smaller than the divisor, and i> < d, hence, by the table, we add 

(-1), ' 

(-1) + (-!) = -2. 

Therefore, the quotient = .05. 

2. 1,000 -^ .025. 

In obtaining the number of the first digit, 4 — (— 2) « 6. 
From the table, add (—1). 

6 + (- 1) = 5. 
Therefore, the quotient is 40,000. 

Squares. 
Follow the same rules as for multiplication. 

Square Roots. 

Divide the given number into periods of two figures each, 
coimting each way from the decimal point. Now count the 
number of periods from the decimal point to the first significant 
digit. This will give the number of the first significant digit 
of the root. 
Examples: 

1 >l36W. 

Counting periods to the first significant digit gives + 2. The 
root, then, is 60. 

2. V.00'00'81. 

The number of the first significant digit is —3. The root, then, 

is .009. 
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PART III. 

ANALYSIS OF THE MATERIAL COLLECTED. 



CHAPTER IX. 
TABULATION. 

Sec. 49. General Rules. 

At first thought, it would seem the simplest thing 
in the world to construct a table, but the beginner who 
attempts to tabulate a complex group of figures will 
quickly discover that the simpUcity of the operation is 
far more apparent than real. In fact, when a scientific 
tabulation has once been made, it is often found that a 
large share of the work of analysis is completed. This 
is so far true that, until quite recent years, statisticians 
looked upon the table as the "ultima thule" of their 
efforts. 

In beginning a tabulation, the first question that 
arises is whether the figures should be grouped in 
one or several tables. A single table has the merit of 
compactness and the data are thus brought into 
proximity. The table, however, if too large, becomes 
confusing to the eye and there is great difficulty in 
following the lines and columns at a glance. This 
difficulty may be partially obvistted by varied modes of 
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ruling and spacing but, in general, it is better, where 
practicable, to break the table up into several separate 
sections. 

Each table should be a unit. Rarely, indeed, should 
one attempt to demonstrate in the same table several 
comparisons of different natures. For each distinct 
purpose, there is usually one tabular form which is 
best suited to bring out the point desired. If another 
topic is included, the result is that each result is ob- 
scured. For example, it would ordinarily be imwise to 
attempt to show statistics of wages and imemployment 
in the same table for the groups which would best 
illustrate earning capacity mi^t differ considerably 
from those which would bring out the characteristics 
of imemployment. 

Another matter to be decided upon is whether the 
table shall show absolute figures, percentages, or both. 
This depends upon the kind of comparisons to be made. 
If one wishes to compare the wheat crops of various 
nations, it is manifestly useless to reduce the amoimts 
to percentages of the world's crop for the percentages 
would be in the same ratio as the absolute figures but, 
if the object is to compare the amount of insanity 
among city and country dwellers, the actual numbers 
of insane in each place would tell us nothing and only 
when reduced to ratios or percentages does any meaning 
appear. In a reference work or in original investiga- 
tions, the absolute figures should always accompany the 
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percentages so that they will be available for other 
studies of a diflferent nature. Percentages or ratios, 
as well as totals and averages, are essential features 
for the majority of tables. 

The number of separate headings or columns to be 
used is a third query which must be answered. The 
more minute the subdivisions, the greater the accuracy 
attainable. On the other hand, a multiplicity of head- 
ings prevents the proper emphasis being given to the 
main facts and tendencies shown by the statistics. 
The exact number of divisions is something that de- 
pends on the specific data in each case and must be left 
to the judgment of the statistician. In general, it is 
more satisfactory to use a few main divisions with 
several subheadings under each than to have a large 
number of coordinate titles. It is possible, by using 
this method, to enter total or percentage columns for 
each of the main divisions thus bringing out distinctly 
the principal ideas while still reserving the minor 
columns for details. If the tables are large so that 
these total or percentage columns fall too far apart 
for easy comparison, it is best to enter these together 
in a separate summary table so that the eye can take 
in the general results at a glance. In a sjmoptical 
table of this sort, it is often preferable to simply state 
the results in round numbers, since for the reasons 
mentioned in Sec. 42, a large number of digits tends to 
confuse the mind and prevent a proper grasp of the 



86 ELEMENTS OF STATISTICAL METHOD. 

meaning of the "figures. The column may, for example, 
be entitled simply " Expenditures in tens of thousands" 
or " Product in millions of bu.," thus eliminating several 
superfluous digits. It is rarely advisable to drop the 
final digits in the primary tabulation and it should not 
be done in the smnmary tables if they are likely to be 
used primarily for reference rather than for the purpose 
of drawing general conclusions. 

Sec. 50. The Title of the Table. 

The title of the main table as well as of each of the 
subheadings should be complete and self-explanatory 
for the reader will seldom care to take the trouble to 
hunt up references in the text or footnotes in order to 
learn the significance of certain headings. A common 
error which should, of course, be avoided, is to make 
the title of too small scope to cover all the data in the 
table. Another frequent mistake, which is more seri- 
ous is to have a title which is so indefinite as to permit 
of a double meaning, as, e. g., "Percentages Engaged 
in Various Occupations, by Nationalities" may mean 
either the percentage of workers in a given occupation 
belonging to each nationality or, on the other hand, 
the percentage of the workers of each nationality 
engaged in the given occupation. 

The column headings should, where measurements 
are included, invariably state the unit used, as "Height 
in inches," "Price in dollars," etc. 

Titles should always be in Roman characters rather 
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than in script, the typing of each heading should corre- 
spond in size and prominence to its respective impor- 
tance, and all coordinate headings should appear in 
type of like size and style. The second requirement 
prevents typewritten tables from being satisfactory if 
the headings are at all complex. When there are a 
large number of coordinate headings, some systematic 
sequence must be determined upon. For instance, the 
states of the Union might be placed in order of geo- 
graphical location, area, population, alphabetically 
according to their names, or with some other logical 
criterion in mind. 

Sec. 51. Form of the Table. 

The table should always be roughly drafted in its 
complete form before any of the ruling of the permanent 
table is begun or any figures are entered. This is 
imperative in order that the table may be adjusted to 
the size of the sheet, the proper width of the columns 
be calculated and the correct arrangement of the head- 
ings be provided for. Space must usually be allowed 
for percentages or averages as well as for totals. 

All numbers which are to be compared must be placed 
close together and, wherever possible, they should be 
placed in the same vertical column rather than the same 

m 

horizontal line. Columns which are intended for com- 
parison should be placed as close to each other as 
possible. Totals, averages, and percentages should 
invariably be placed adjacent to each other. Since 
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these are usually the parts of the table in which the 
general reader is primarily interested, their normal 
XX>sition is at the top or in the left-hand colimMis of 
the table. Custom has placed them at the close of the 
table instead, but the U. S. Census Bureau has followed 
the new i)olicy and its example seems worthy of imita- 
tion. Unimportant totals, used only as a check, should 
be relegated to the customary final i)osition. 

The rulings in tables should, like the headings, indi- 
cate the importance of the various subdivisions. The 
principal groups should be separated by heavy or 
multiple-ruled lines and the breadth of the lines should 
decrease as the subdivisions become lower and lower 
in rank. In tables of any considerable size, the figures 
should cross the paper in horizontal bars five to eight 
Unes in depth with narrow blank spaces between the 
bars.^ In this way, the eye is saved much difficulty 
in locating a desired number. 

Since it is often impossible to provide sufficient 
columns to specify all the different types of data enu- 
merated, the odds and ends are usually placed together 
in a "Miscellaneous" group. This saves room and 
also avoids large blank spaces in the table, which are 
undesirable, since they confuse the eye in its effort to 
follow columns or horizontal lines. Exceptional items 
should be marked with an asterisk or number referring 
to an explanatory note, similarly marked, at the foot 

* See Table VII, accompanying Sec. 62. 
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of the page. If the exceptions are too numerous, or 
if the "Miscellaneous" group is too large, the value of 
the table is likely to be seriously affected since the 
results are either rendered incomplete or lack homo- 
geneity. 

Sec. 52. Accuracy in Tabulation, 

Care should be taken to have every item in a table 
accurate, for the discovery of a few errors is sure to 
throw doubt on the merit of the work as a whole. To 
obtain accuracy, a regular system of checks is necessary. 
In the first place, each item should be gone over to see 
that the original entries are correct. This having 
been decided in the aflBirmative, the rest of the operation 
is a mechanical process. To check the totals, it is 
often feasible to add the items both in vertical columns 
and in horizontal lines. These partial totals should 
then be summated and, if the same grand total is 
obtained in each case, the additions are almost certainly 
correct. In cases in which this method is inapplicable, 
each column should be re-added. 

Percentages may be checked by adding together to 
see if their sum equals one hundred per cent. Averages 
may be multiplied by the number of items and the 
product compared with the total. 

Multiplications and divisions should ordinarily be 
performed twice, preferably by two persons. By using 
such means as this, the chances of error may be reduced 
to a minimum. 
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Sec. 53. Analysis of Results. 

The results disclosed by a tabulation are seldom fully 
revealed at a glance. Much is therefore added to the 
value of a table if it is accompanied by a written analysis 
which points out the principal conclusions which may 
be deduced therefrom, the possible errors involved, and 
the probable causes of the phenomena. The power to 
analyze a table, interpret the results ' correctly and 
state the conclusions lucidly and succinctly is one of 
the characteristics indispensable in a good statistician. 
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CHAPTER X. 

SIMPLE DIAGRAMS. 

Sec. 54. Use of Diagrams. 

Figures, at best, are not easy things for the mind to 
grasp and hold long enough for purposes of comparison. 
When read to an audience, they become practically 
meaningless. In a book, their essential indications can 
only be ascertained by careful scrutiny. One of the 
chief aims of statistical science is to render the meaning 
of masses of figures clear and comprehensible at a glance. 
To attain this end, many devices have been invented to 
supplement or explain the table, of these, graphic 
illustrations being the most common. This chapter is 
devoted to an explanation of some of the simpler means 
used for this purpose. 

Sec. 55. Cartograms. 

Many phenomena which vary with geographic loca- 
tion are best illustrated by means of cartograms or 
statistical maps. Several varieties of maps may be 
used, each of which has its special merits. If only a 
single map is to be made or if printing costs are not 
prohibitive, various colors and shades may be used with 
great success. This plan is best exemplified by the 
'^ Statistical Atlas of the United States" published by 

the Census Bureau. Here we find a large collection of 
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maps, attractive in appearance and easily interpreted. 

Since color printing is rather expensive, the same 
ends can frequently be attained at a less cost by using 
various modes of barring or cross-hatch work to indicate 
the varying degrees of density to be recorded. The 
rainfall maps printed in the newspapers are good ex- 
amples of this class. 

A third variety of cartograms, which has shown ex- 
ceptional merit for certain purposes, is the dotted map. 
If one wishes to indicate the wheat production of the 
United States, he simply places a dot for every himdred 
thousand bushels raised in a certain section. The 
amount of product indicated by a single dot should be 
such that the dots will be placed quite close together 
in the regions of greatest density. Professor H. C. 
Taylor, of the University of Wisconsin, has used this 
method extensively in his studies in Agricultural 
Economics. 

Sec. 56. Pictograms. 

The baker who was trying to impress the public 
with the small size loaf given by his rival. Smith, 
advertised for some time that he sold a sixteen ounce 
loaf while Smithes weighed but twelve. The advertise- 
ments had little effect. When he inserted the picto- 
gram shown in Fig. 1 the response was instantaneous. 
This illustrates the importance of graphic methods for 
illustration. A great variety of devices are used, thfe 
commonest and simplest probably being the bar dia- 
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gram. The size of the number is, in this method, 
simply represented by the length of the bar. Fre- 

COMPARATiyE PiCTOGRAMS 





our loaf ' smith's loaf 

Fig. 1. 

quently, the composition of the nmnber is also repre- 
sented by the shading of the bar as illustrated in Fig. 2. 
This has the advantage of bringing the whole study into 
one diagram and the disadvantage that all the sections, 
except the first, are rendered rather difiicult of exact 
comparison since they do not originate or terminate on 
the same straight line. The plan may be modified by 
making a separate set of bars for each group. 
. For the illustration of more complex mathematical 
relationships, the use of bars is usually insufficient. It 
is necessary in such cases to resort to figures of two or 
three dimensions. Fig. 3 represents a comparison of 
the hourly wage, length of working day, distribution of 
expenditures of the workers, and number of workers 
employed in industries A and B. It may be said that, 
in general, as the number of factors to be compared 
increases, the accuracy of comparison of some of the 
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groups is diminished. In the illustration, while it is 
quite easy to compare the number of workers, the 

Bar Diagrams. 
Population of Cities by Race and Nativity. 



CITY 



B 



W////M/M/A 



y/////////^////////////. 



y////M////////////^^^^^^ 



f////////////////////^^^^^^ 



LEGEND 
Nativt White E 



Fig. 2. 



Colored 
Foreign WTvUc 



hourly wages, and the hours worked per week, both 
the absolute amoimts and the percentage of the income 
spent for the different items are not easily comparable 
because of the varying shape of the rectangles or 
parallelopipeds involved. 
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Another useful variety of pictogram is the circle 
divided into sectors as shown in Fig. 4. These sectors 
may be colored or shaded as desired. In the illustra- 



INDUSTRY A 



Block Pictogramb. 

industry b 





HOURLY WAGE 



HOURLY WAGE 



Fig. 3. 

t^on, the size of the circles represents the total cultivated 
area while the angular dimensions of the sectors show 
the relative importance of each crop in the states in- 
volved. 

For comparing relative volumes, cubes or spheres are 
frequently used and, in popular works, we often see 
pictures representing the articles in question as, for 
example, a line of ships whose sizes represent the 
merchant marines or navies of different nations or a 
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row of bales showing the comparative amounts of 
cotton produced. It should always be remembered 
that, in all figures depicting area, the dimensions must 

Circle Pictograms. 
Cultivated Area Devoted to Various Crops. 

STATE A STATE B 





Fig. 4. 

vary as the square roots of the areas represented and, 

if volumes are to be illustrated, the dimensions must 

vary as the cube roots of the contents. Failure to 

observe this rule often results in grotesque misrepre- 

sentations. 

REFERENCES. 

The Statistical Atlas of the Unites Stales (Census of 1900). 
Bailey, Wm. Bacon. Modem Social Conditions. Century Co., 

N. Y., 1906. Pp. 54-66. 
BowLEY, A. L. Elementary Manual of Statistics, Chap. V. 
Bertillon, Jacques. Cows EUmentaire de Statistique, Chap. 

XI. 
Block, Maurice. TraiU de Statistiquey Chap. XIII. 



CHAPTER XI. 

FREQUENCY TABLES AND GRAPHS. 

Sec. 57. Use of Frequency Tables. 

In the study of most groups of natural phenomena, 
we find that the different items have varying character- 
istics. Some elm trees are tall, others are short; some 
men are rich, others are poor; some cities are large, 
others are small. Things which thus vary in size may 
be spoken of as variables. In the cases mentioned 
above, the variation was between different things of the 
same kind or species. It is also commonly true that 
the same thing changes in characteristics with the 
passage of time. Trees grow taller, men become richer 
or poorer and cities become larger. This method of 
change, in which a period of time is necessarily involved, 
may be designated as historical variation. It is dis- 
cussed more fully in Chap. XV. 

In studjdng things of the same variety, the work may 
usually be facilitated by dividing the items into classes. 
The simplest mode of classification is to group all the 
instances under two headings, the determining factor 
being whether they do or do not possess a given charac- 
teristic. Thus, we may classify people as sane or insane, 
workmen as employed or idle, flowers as white or 
colored, men as short or tall. For some purposes, this 

division by dichotomy, or cutting in two, may be most 
8 97 
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satisfactory but, in many cases, the difficulty arises 
that there is no distinct dividing line. Thus, it is im- 
possible to say at just what point a man ceases to be 
short and becomes tall. It is, therefore, necessary to 
lay off arbitrary boundaries between the two classes. 
But, if classes are to be thus arbitrarily established, 
it is often much more advantageous to set up a larger 
number of them rather than two only. In practice, this 
is usually done by dividing the whole group into classes 
of equal width. Thus, if the tallest tree in a group is 
39 feet and the shortest 16 feet high and it is desired 
to divide the entire group into five classes, the boundary 
lines would preferably be fixed on the round numbers 
15, 20, 25, 30, 35, and 40 feet. These boundary lines 
are known as the class-limits and the distance between 
the two limits of any class is designated as the class- 
interval. In the instance cited above, the class-interval 
would evidently be 5 feet. A table formed by thus 
dividing a group into a nmnber of smaller more homo- 
geneous classes and indicating the number of items to 
be found in each class is known as a frequency-table. 
The number of items falling within a given class con- 
stitutes the size of that class or its frequency. 

If one is told that there are one himdred men in a 
community and that their total possessions aggregate 
one million dollars or that they possess,,on an average, 
ten thousand dollars worth of property each, he still 
knows very little about the welfare of the neighborhood. 
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Does each of the men have an equal share of the wealth 
or does one man own $990,100 and the remainder $100 
each? In studying the distribution within the com- 
munity we find use for a frequency table. The mode 
of procedure is to divide the population into compara- 
tively small classes according to the amount of their 
wealth and then compare the relative size of the various 
classes or in other words the frequency distribution. 
If a somewhat similar community of a hundred men 

TABLE III. 

Simple Frequency Table Showing Distribution of Wealth. 

(Small Clafis-interval.) 



Size of Items or Wealth per Man. 

TO 


Frequency or N o of Men in Class. 


$ a-$i,ooo 

1,000- 2,000 
2,000- 3,000 
3,000- 4,000 
4,000- 5,000 


6 

8 

10 

12 

14 


6,000- 6,000 
6,000- 7,000 
7,000- 8,000 
8,000- 9,000 
9,000-10,000 


• 10 

9 

10 

6 

2 


10,000-11,000 
11,000-12,000 


3 
1 


12,000-13,000 
13,000-14,000 
14,000-15,000 


2 

2 


15,000-16,000 
16,000^7,000 
17,000-18,000 
Above 18,000 


1 
1 
1 
3 




w = 100 



100 ELEMENTS OF STATISTICAL METHOD, 

were thus arranged, we might find results something 
like the following providing there was a fairly equal 
distribution. 

We observe here that the number of men per class, 
or the frequency, rises to a maximum (known as th^ 
mode), in the $4,000-5,000 subgroup and then gradually 
falls off, but with some irregularities. Let us see what 
the efifect will be if we increase the class-interval. 

TABLE IV. 

Simple Frequency Table Showing Distribution op Wealth. 

(Larger Class-interval.) 



Wealth per Man. 


No. of Men in Class. 
/ 


$ a-S3,ooo 

3,000- 6,000 
6,000- 9,000 
9,000-12,000 
12,000-15,000 
15,000-18,000 
Over 18,000 


• 23 
36 
25 
6 
4 
3 
3 




n = 100 



It is evident that increasing the class-interval or the 
distance between the class-limits from $1,000 to $3,000 
has increased the regularity of the rise and fall of the 
figures in the second column. To gain this regularity, 
however, we have been compelled to sacrifice, to a 
certain extent, the details of the picture. Here we 
have, then, the same old conflict between symmetry of 
general outlines and accuracy of detail. The exact 
number of classes to be used must always be left to 
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the judgment of the statistician but^ in general, the 
number should be as large as it may be made without 
sacrificing the approximate regularity of the progression. 

The inquiry that naturally next suggests itself is 
whether the regular rise and fall noted in the preceding 
tables is simply an arbitrary assumption, a character- 
istic of the distribution of wealth, or a feature conmion 
to many varieties of phenomena. A little investiga- 
tion will show the last to correspond to the facts; 

If one hundred and thirteen leaves are picked, purely 
at random, from a given tree and then arranged in order 
of their lengths, and if a line is drawn on the paper 
corresponding to the length of each leaf, the results 
will appear as shown in Fig. 5. It will be observed that, 
near the extremes, the lengths change rapidly while, 
from the fortieth to the sixtieth leaves, the lengths are 
practically constant. This means that, if placed in 
groups having a class-interval of one centimeter, those 
classes between three and five and between seven and 
thirteen centimeters in length must contain few leaves 
while the class of six to seven centimeters in length 
would include over half of all the leaves. Evidently, 
then, we find the same tendency to a rise and fall in 

the size of classes in natural as well as in economic 
phenomena. 

In an actual experiment of throwing three dice one 
hundred and ninety-six times, the following results 
were obtained: 
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Frequency Table Showing Results of Throwing Three 

Dice. 



No. of Spots (Size of Item). 


No. of Times Occurring (Frequencr). 


m 


/ . 


4 


1 


5 


4 


6 


11 


7 


10 


8 


24 


9 


22 


10 


22 


11 


32 — 


12 


17 


13 


23 


14 


9 


15 


7 


16 


7 


17 


4 


18 


3 




n = 196 



This shows us that, in a matter of pure chance, the 
rise and fall of the frequencies occurs exactly the same 
as in the case of natural phenomena, and that the 
number of spots thrown fluctuates about the mode 
which is apparently close to eleven. 

From these few examples, to which others might be 
added indefinitely, we can deduce the following law: 
Both chance and natural phenomena tend to fluctuate 
about a norm known as the mode. The large majority 
of the items are usually grouped near the mode and, 
as the distance from the mode becomes greater, the 
items become rapidly fewer in ntunber. In natural 
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phenomena, a maximiini deviation may be approxi- 
mately determined beyond which no items occur. 
This law which, as we have seen, was discovered by 

Frequexcy Line Diagram. 
Results of 196 Throws of Three Dice. 




\ 



bi 



5 10 15 20 25 

SIZE OF ITEMS 

(HO. OF SPOTS THROWN) 

Fig. 6. 

Quetelet is the basis upon which a large part of the 
science of statistics is built. In the leaf lengths, 
illustrated in Fig. 5, the mode is evidently located at 
about the fifty-sixth leaf length, since at this point the 
right-hand margin of the lines becomes most nearly 
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vertical, indicating the largest number of leaves within 
a given variation in length. We would say, therefore, 
that the modal length of this variety of leaves is about 
6.7 cm. 

Another inquiry likely to occur to the reader is 
whether an increase in the number of items would 
affect the location of the mode, that is, if five hundred 
leaves had been used, would it have changed the results. 
Experience shows that the only effect of using a larger 
number of items, provided the smaller number selected 
were fair samples, is to obtain a greater regularity in 
the variation in the sizes of the classes. This is due 
only to the fact that, the larger the number of items, 
the greater the chances of obtaining a fair sample of 
that class of objects in general. 

Sec. 58. Classification in Frequency Table. 

We have already noted the comparative merits of 
small and large class-intervals. Several points need 
also to be kept in mind. Class-intervals should all be 
equal, that is, the classes should be of imiform breadth 
and their limits should all be entered in the proper 
column whether any items occur in the class or not. 
If this is not done, errors in plotting the results are 
almost sure to be made. The size of the items of 
the class may be, indicated by a single figure, as 3 cm. 
In this case, the class contains all items between 2.5 
and 3.5 cm. When the class-interval is greater than 
one unit the titles may be entered as 3-7 cm., 8-12 cm., 
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etc. The first class, then, includes all items between 
2.5 and 7.5 cm. Since items, when originally meas- 
ured, are usually read to the nearest unit, the fore- 
going is the most generally applicable system of 
headings. 

If, however, the items are measured purposely 
for use in this frequency table, it is often practicable 
to make the classes read 3-7 cm., 7-11 cm., 11-15 cm., 
etc. Now, the limits of each class are on the even 
unit and the item 6.99 cm. in length would fall in 
the 3-7 cm. class. 

It is always advisable, where possible, to so arrange 
the classes that the mid-point of each falls on an even 
unit and not on a fraction. This facilitates multi- 
plication and other mathematical operations, in those 
cases in which the class is considered as a uniform 
whole. One should constantly bear in mind that, in 
the study of natural phenomena, the items do not 
fall on whole numbers and that the actual measure- 
ments are distributed more or less evenly through a 
group the boundaries of which are wholly artificial 
and arbitrary. Nature, itself, recognizes few sharp 
dividing lines. The classification adopted must, 
therefore, be purely arbitrary. 

Sec. 59. Continuous and Discrete Series. 

Since the size or weight of natural objects is likely 
to fall at any point whatsoever between certain limits 
and can never be determined with mathematical 
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exactitude, but is always measured by approxima- 
tion, a number of such recorded measurements con- 
stitutes a continuous series. On the other hand, 
in throwing dice, it is never possible to obtain any- 
thing but an integral number of spots. A collection 
of items of this variety constitutes a discrete or broken 
series. A record of wages paid in a factory is Lkely 
to be a distinctly discrete series, for the wage is usually 
a certain nxmiber of dollars per week, the money 
unit seldom being smaller than half a dollar. Hence 
we would find no one receiving $14.39 a week and 
gaps would appear in the scale. In the last analysis, 
every series measured in a money unit must be dis- 
crete, since the smallest money unit may always be 
considered as divisible. 

Sec. 60. Frequency Graphs for Discrete Series. 

We have found graphs very serviceable in present- 
ing other varieties of statistics to the eye and we shall 
see that they are just as useful in illustrating the 
frequency table. The simplest mode of illustration 
for a discrete series is the line or bar frequency dia- 
gram. This is especially useful in those instances 
in which the number of units in the scale of items is 
small enough so that each may be indicated by a 
separate Une or bar. This is well illustrated in Fig. 6, 
which represents the frequency of various dice throws, 
as shown in the table in Sec. 57. Since there are no 
items occurring at fractional points on the scale, a 



108 ELEMENTS OF STATISTICAL METHOD. 

smooth curve, if drawn, must not be interpreted as 
indicating the existence of frequencies to be inter- 
polated in the intervals between the integers but only " 
as an indication of the normal frequencies at each of 
the integral points. The bell-shaped curve simply 
represents the relative height to which the vertical 
lines would extend provided the mmiber of throws 
was mj&nite. This bell-shaped form, which tends to 
recur constantly in the study of natural phenomena, 
is known as the normal frequency curve or sometunes 
as the normal curve of error. In the given experi- 
ment, ten spots were thrown twenty-two times. By 
the curve, we see that, normally, both would occur 
the same number of times, the normal nxmiber of 
instances being about twenty-sbc. In the experi- 
ment, eighteen spots were thrown three times and 
three spots not at all, yet it is evident that one should 
have occurred as often as the other, since each dice 
contains a six and a one on its respective faces. The 
niunber of sixes thrown in this experiment was abnor- 
mally large and the niunber of ones abnormally small. 
The object of smoothing, then, is to eliminate acd- 
dental variations and establish normal tendencies. 

Sec. 61 . Rectangular and Smoothed Frequency Graphs 
or Histograms. 

It often happens that, in either a discrete or con- 
tinuous series, there is a great difference in size between 
the largest and smallest items and that instances occur 
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at a great number of points between the two extremes. 
In the case of a continuous series, as of the leaves con- 
sidered in Sec. 57, an item may be any one of an infinite 
niunber of lengths. In such cases, it is clearly imprac- 
ticable to place a line at each measurement at which 
one or more items occur. The data must be divided 
into classes with arbitrary dividing lines and each group 
is then treated as a whole. Were we to measure the 
height of cornstalks in a field, it would naturally be 
possible to find one at almost any assigned length 
between certain Umits. If we measured a representa- 
tive group of stalks, the results might be something 

like the following: 

TABLE VI. 

Frequency Table Showixg Heights of CoRNSTAiiKS. 



Height in Ft. (Size of Item), 
m 


No. of Stalks (Freqaencj). 


3- 4 

4- 5 

5- 6 

6- 7 
7-8 
8- 9 
9-10 


3 
7 
22 
60 
85 
32 
8 




n=217 



This table could be illustrated by the rectangular 
diagram or histogram shown in Fig. 7. This series of 
rectangles illustrates fairly accurately the relative size 
of the various classes but, since the class boundaries 
are arbitrarily chosen, different sizes or arrangements 
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of classes would give noticeably different resiilts. Evi- 
dently, if a larger number of items were measured and 
the class-intervals made narrower, the steps would 
decrease in size and gradually approach in form a 
smooth curve. The rectangular histogram is a better 
representative of the 217 cornstalks actually measured 
than any smooth curve would be, but the smooth curve 
would be much more representative of all the stalks 
in the field. Since it is the latter question in which 
we are usually interested^ our aim is to smooth the 
graph so as to approach a curve which will be as typical 
as possible of the field as a whole. 

A common method of approaching this ideal is simply 
to connect the outer extremities of the base of the 
graph with the midpoints of the tops of the rectangles 
as is shown in the dotted line a in Fig. 7. This roughly 
approximates the correct outline, but introduces one 
or two errors. 

First. Though the area included in the final figure 
and that included in the rectangular histogram should 
be identical, it will be observed that when, in the figure, 
each included triangle is numbered to correspond with 
an excluded triangle, excluded triangles 1 and 8 have 
no corresponding included areas, hence, the line a en- 
closes slightly too small an area. 

Second. If the limits of the central class had been 
fixed at 7.4 and 7.6 instead of 7 and 8 it probably would 
have towered, relatively, somewhat higher than in the 
existing form. 
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Third. It is probable that stray items would occur 
outside the limits chosen, that is^ there might be 
stunted stalks just under three feet or an occasional 
giant towering above the ten foot limit. The proba- 
biUties in this respect may best be dealt with from 
information outside of the table itself. 

Keeping these three points in mind, we can now 
proceed to draw the final smooth curve or smoothed 
histogram to show what we believe to be the actual 
distribution of heights of the cornstalks in the field. 
This curve is shown in the continuous line 6. We see 
that it has a marked resemblance to the bell-shaped 
normal curve obtained from the dice diagram. 

In practice, it is very common to omit entirely the 
construction of the rectangular histogram and simply 
plot the frequencies at the midpoints of the classes, 
getting as a result the frequency polygon a, as the first 
product. This is the easiest method but it is more 
difficult to smooth properly and, when accuracy is 
desired, it is better to proceed as has been done in this 
case. If several frequency graphs are to be plotted 
on the same sheet for purposes of comparison, either 
frequency polygons or smoothed histograms must usu- 
ally be used, since the many lines in a rectangular 
histogram are too confusing, and the vertical lines 
usually coincide. 

When it is desired to smooth a frequency polygon, 
one should remember that it is really derived from the 
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rectangular histogram and proceed accordingly. This 
will mean that the top of the curve usually overtops 
the highest point of the frequency polygon, especially 
when the classes are rather large. All sharp turns 
should be avoided, the curve should change direction 
as Uttle as possible, and, in most cases, irregular fluctua- 
tions should be smoothed out. The extent of the 
smoothing necessary or permissible must depend largely 
on the specific data involved. If the data consist of 
records of natural or chance phenomena which normally 
approach a symmetrical curve of error, smoothing may 
be freely indulged in, but if economic or sociological 
data are involved, considerable irregularities may really 
exist in the normal curve and, as a result, only the 
minor irregularities should be eliminated. When all 
the data are at hand, it must be remembered that 
smoothing brings out tendencies but obscures actual 
facts. 

Since nearly every smoothed histogram representing 
a continuous series begins with an infinitely small num- 
ber of instances and decreases again slowly to zero, it 
should begin and end on the base hne. While no abso- 
lute rule can be laid down, the curve should, normally, 
reach the base about the middle of the base of the next 
class outside that in which the extreme instances he. 

Sec. 62. Comparative Histograms. 

Histograms are very useful for comparing the struc- 
ture of two or more groups of data. Table VII and 
9 
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Fig. 8 give illustration of histograms showing the com- 
parative wages in three states, A, B, and C The 
number of wage earners in state A is comparatively 
small and the bulk of the workers receive wages be- 
tween $8 and $16 per week. In state B, two diverse 
groui)s are distinctly noticeable. This may be a state 
where many women and children are employed at low 

Abboltttb Histograms Showing CoMPAHATiyE Wages in 

States A, B, and C 




8 10 12 14 16 18 20 22 24 26 28 

VWAOE PER WEEK IN OOUARS 

Fig. 8. 



wages but the men tend to receive high wages. There 
are fewer workers receiving very high wages in state C 
than in either of the other two, but the general standing 
of wages is high. 
It will be observed that, in Fig. 8, the comparison of 
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wages is partially obscured by the difference in the 
general altitude of the curves due to the fact that one 
state has so many more workers than another. To 
eliminate this defect, it is advisable to reduce all fre- 
quencies to percentages by dividing each class by the 
total number of items as has been done in the following 
table. Thus, in state A, the nimiber of wage earners in 

TABLE VII. 

Frequency Table Showing Comparative Wages in States 

A, B, AND C. Both Absolute and Percentage 

Frequencies being Given. 



Wages per 


state A, 


Stated. 


State C. 


'S® s 




*iioi 


Hi 


^ • S 


Is,?* 


Week. 


o s,® 


-ttet 9 


o S)S 


■2 rt ® 


o s>& 


■2 Sr® 






Percer 
OfW 
Earn 


i^ 


Percer 
ofW 
Earn 


fi * u 

1^1 


Percer 
OfW 
Eara 


$ 0- 1.99 


25 


0.0 


210 


0.1 


1,114 


0.5 


2.00- 3.99 


1,460 


1.5 


4,630 


2.3 


4,986 


2.3 


4.0O- 5.99 


3,784 


3.9 


16,424 


8.1 


10,102 


4.8 


6.00- 7.99 


5,025 


5.1 


24,898 


12.3 


17,170 


8.1 


8.00- 9.99 


13,200 


13.5 


12,122 


6.0 


22,054 


10.4 


10.00-11.99 


17,420 


17.7 


8,964 


4.4 


28,402 


13.3 


12.00-13.99 


16,142 


16.5 


17,220 


8.5 


33,960 


15.9 


14.00-15.99 


13,240 


13.5 


35,116 


17.3 


34,817 


16.3 


16.00-17.99 


10,940 


11.2 


34,963 


17.2 


31,460 


14.8 


18.00-19.99 

• 


7,964 


8.1 


17,842 


8.8 


24,972 


11.7 


20.00-21.99 


4,982 


6.1 


12,240 


6.1 


3,417 


1.6 


22.00-23.99 


2,786 


2.8 


7,963 


3.9 


546 


.3 


24.00-25.99 


962 


1.0 


6,241 


3.0 






26.00-27.99 


70 


.1 


3,196 


1.5 






28.00-29.99 






971 


.5 






Total 


98,000 


100.0 


203,000 


100.0 


213,000 


100.0 



116 ELEMENTS OF STATISTICAL METHOD. 



each class is divided by the total, 98,000, giving the per- 
centages shown in the third column. The sum of the 
percentages must, of course, be 100 in each case. In 
Fig. 9, we see the results of plotting these percentages 

Percentaqe Histograms Showing Comparahye Wages in 

States A, B, and C 
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Fig. 9. 

in the fonn of histograms. By this means, all states, 
large or small, are placed on an equal footing, and the 
study is much simplified. In most instances, these 
percentage histograms give much more satisfactory 
comparisons than do those histograms using the abso- 
lute figures. 

In general, it may be said that smoothed histograms 
form one of the simplest and best of all means of com- 
paring the frequency distribution of two or more groups 
of data and they are applicable to a great variety of 
cases. 
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Sec. 63. Cumulative Frequency Tables. 

A cumulative frequency table is constructed by 
simply adding together the frequencies given in a 
simple frequency table. In this procedure, each class 
is made to include all the lower classes. If we return 
to our table of cornstalk heights and cumulate the 
frequencies, the results will be as follows: 

TABLE VIII. 

Cumulative Frequency Table Showing Heights of Corn- 
stalks. 



Height In Ft. (Size of 


No. of stalks (Fre- 


CumulatiTe Fre- 


Item). 
m 


quency). 


quency. 


3- 4 


3 


3 


4r- 5 


7 


10 


5- 6 


22 


32 


6- 7 


60 


92 


7- 8 


85 


177 


8- 9 


32 


209 


9-10 


6 


217 




n = 217 





Sec. 64. The Ogive. 

By plotting the data given in the. last column, we 
shall obtain a cumulative frequency graph or ogive. 
It will be remembered that in constructing a frequency 
polygon the frequency must be plotted at the midpoint 
of the class but, in laying out an ogive, it must always 
be plotted at the upper limit of the class instead. Fig. 
10 shows such an ogive constructed for the above table 
and then smoothed. A glance at the figure shows how 
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much more readily this is accomplished in the case of 
the ogive than in that of the histogram. This is one 
of the marked advantages of the ogive. 

Anqttlab and Smoothed Ooiye SHOwn^o Heights 

OF Cornstalks. 
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Ogives, like histograms, may be used for comparing 
groups of statistics in which time is not a factor. Just 
as in the case of histograms, better results are obtained 
if the frequencies are reduced to percentages. Ogives, 



FREQUENCY TABLES AND GRAPHS. 119 

in general, are more difficult for the ordinary person to 
interpret than histograms and are used primarily for 
the determination of medians, quartiles, percentiles, 
etc., a matter which will be discussed in a later 
chapter. 

Sec. 65. General Rules for Construction of Graphs. 

1. Rule the axes in heavy black lines. 

2. Choose a scale which will include all your items 
and at the same time fit the paper 

3. In arranging the scale, always place round numbers 
on the heavy lines of your section paper. If the paper is 
divided on the decimal plan, number your scale 5, 10, 
15 . . . ; 10, 20, 30 . . . , or in some other numbers 
which are readily divisible on the same scale as the 
paper, never using such a scale as 3, 6, 9 ... , the 
fractions of which fail to correspond to the divisions 
on the paper. Never number the scale simply to agree 
with the frequencies given in the table. 

4. Graphs should, in general, cover the main part 
of the sheet of paper used. They should be on a large 
enough scale to bring out such details as are desired, 
but a graph small enough to be taken in at a glance 
is preferable, for most purposes, to one of greater 
size. 

5. The graphs should be as accurate as convenient 
to make them but, if they illustrate sufficiently the 
points that it is desirable to bring out, great precision 
in every detail is not. essential. 
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CHAPTER XII. 
TYPES AND AVERAGES. 

Sec. 66. Uses of Types or Averages. 

Averages are used 1. To give a concise picttire of a 
large group. We could not grasp the idea well if given 
the height of every tree in a forest but the average 
height is something definite and comprehensible. 

2. To compare different groups by means of these 
simple pictures. It follows as a corollary of No. 1 that, 
before we can compare two groups, we must have a 
definite picture of each in mind; thus, two forests can 
only be compared by means of totals or averages of 
some sort. 

3. To obtain a picture of a complete group by the use 
of sample data only. It has been found, in practice, 
entirely superfluous to measure the heights of every 
person of a race to obtain the typical height of that 
people. An average obtained from a few hundred 
samples is so close to the exact average of the whole 
that the difference is negligible. 

4. To give a mathematical concept to the relationship 
between different groups. We may say that the trees 
in one forest are taller than in another but in order to 
find any definite ratio of heights it is necessary to 

resort to averages. 

121 
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I. THE MODE. 

Sec. 67. The Mode Defined. 

One of the most useful of the types or averages is the 
mode. It is variously defined as the most frequent size 
of item, the position of greatest density, and the position 
of the maTJmtim ordinate in a smoothed histogram. 
When we speak of the average man, the average income, 
etc., we usually mean the modal man or the modal 
income. We might say that the modal workingman's 
house contained five rooms, the modal contribution • 
to a church collection is five cents, meaning, in each 
instance, that this is the vogue, the most usual occur- 
rence, the common thing. 

Sec. 68. Methods of Determining the Mode. 

Once a smoothed histogram has been correctly 
constructed, the mode may usually be located, at a 
glance, by finding the size of item corresponding to the 
greatest ordinate or the highest part of the curve. It is 
perfectly possible for a histogram to have a number of 
distinct modes and, when this is true, each is located in 
the same way. 

If there is but a single well-defined mode, the class 
containing it is at once located in the frequency table, 
but, in many instances, there are numerous irregularities 
in the table, though but one mode is really existent, and 
then the modal class is not so easily selected. In such 
cases, it is best to approximately locate the mode by 
a process of grouping. The procedure is as follows: 
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First, the frequencies are grouped by twos. Then, 
the upper limit of the group is shifted one space and 
again the frequencies are grouped by twos. Next, 
the grouping is done by threes. The upper Umit is 
shifted down one space and the process repeated. 
The Umit is then shifted another space and the process 
again repeated. If necessary, the grouping is now done 
by fours. This method is continued un,til regularity 
is secured and the point of maximum frequency is not 
changed by a shift in the upper Umit of the group. 
The mode now Ues in each of the largest groups in 
the later series of groupings used. In this way, it 
can be definitely placed within certain Umits. The 
following table illustrates the process. Only the items 
near the mode are used, since the inclusion of the ex- 
tremes is manifestly useless. 

TABLE IX. 
Location of Mode by Grouping. 



Size of Item. 
m 

5 
6 

7 

8 

9 
10 
11 
12 
13 
14 
16 
16 
17 




Frequency. 
/ 



] 
} 
} 
} 



156 



182 



177 



148 






178 



174 



171 
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In the first grouping, the mode apears to be either 
13 or 14, since 123 is the maximum sum of frequencies 
obtained. In all the later groupings, however, the 
maximum sum is seen to be shifted to the neighbor- 
hood of 9. As the limits of the groups are changed, 
we find that 9 is the size of item whose frequency is 
constantly contained in the maTimum group. This 
is true of no other size of item, hence the mode is located 
as approximately 9. 

If the class interval used in a frequency table is large, 
it is often desirable to locate the mode within the limits 
of the class. We see that, in the rectangular histogram 
shown in Fig. 7, the class 6-7 is much larger than the 
class 8-9, hence the probabiUties are that the exact 
location of the mode will be nearer 7 than 8. An 
empirical method which has proved serviceable is to 
locate the mode in the modal class according to the 
weights of the classes adjacent to it. This may be 
expressed by the following formula: 

Let I = the lower limit of the class. 

c = the class interval. 

/i = the number of items in the next lower class. 

/2 = the number of items in next higher class. 

Z = mode. 
Then 

If we substitute in the formula the data from the 
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frequency table of cornstalk heights given in Sec. 61, 
we obtain the following: 

2 = 7+^2x1 



32 + 60 



' ^92 
= 7.347+ the required mode. 

In exceptional cases in which the frequencies are very 
irregular, it may be advisable to use two or more classes 
on each side of the modal class as weights, but this 
should not be done if the graph is Ukely to be really 
multimodal. 

The mode may also be located very roughly on an 
ogive by finding the point on the curve at which it is 
most nearly vertical. This may be accomplished me- 
chanically be slipping a ruler along the ogive, keeping 
it constantly tangent to the curve and noting the mo- 
ment when the angular movement of the ruler tends 
to reverse its direction. The ogive, however, is usually 
almost valueless in determining the mode. 

Sec. 69. Advantages of the Mode as a Type. 

1. The mode is useful in cases in which it is desirable 
to eliminate extreme variations. A few trees in the 
forest or a thousand-dollar check in the church collection 
would in no way disturb the mode but would affect 
the arithmetic average. 

2. In determining the mode, it is unnecessary to 
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know anjrthing about the extreme items except that 
they are few in number. We need have no record 
of the number of miUionaires and the size of their estates 
or the number of paupers in order to find the modal 
wealth of the people of the United States. 

3. It may be determined with considerable accuracy 
from well-selected sample data. 

4. It is the type that, to the ordinary mind, seems 
best to represent the group. It is more intelligible 
to say that the modal wage of workingmen in a com- 
munity is $2 per day than to say that the average wage 
is $2.17 when not a single man actually receives the 
latter amount. 

Sec. 70. Disadvantages of the Mode as a Type. 

1. In many cases, no single, well-defined type 
actually exists. One could scarcely pictiu*e a modal 
size of city which would mean anything. In wage 
statistics, we are likely to find several distinct modes 
corresponding to the various grades of labor. 

2. The mode is not at all useful if it is desirable to 
give any weight to extreme variations. If one wished to 
learn how much wealth each person would have were 
all goods equally distributed, he would not be assisted 
by a knowledge of the present modal wealth. 

3. It cannot be located by any simple arithmetic 
process and, in many cases, is difficult to determine 
accurately by any method. 

4. The product of the mode by the number of items 
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does not give the correct total, as is the case when 
the arithmetic average is used. 

5. The mode may be determined by a comparatively 
small ntunber of items of wiif orm size in a large group 
of varying size. Thus, in a community having great 
variations in wealth, the modal value of possessions 
might be $992 simply because three people were listed 
at that amount while the wealth of all others varied. 
This difficulty is overcome in practice by using classes 
of considerable breadth. 

II. THE MEDIAN. 

Sec. 71. Defining and Locating the Median. 

If a number of similar objects are placed side by 
side in order of their size, they are said to be arrayed. 
We have, in Fig. 5, lines representing an array of the 
lengths of 113 leaves. If any group of objects is thus 
arrayed, the middle one is known as the median item. 
Thus, in the leaf lengths illustrated, the fifty-seventh 
item would be the median, for it would have fifty-six 
items on each side of it. The median leaf-length, 
therefore, is about 6.7 cm. If there is an even number 
of items the median item does noi actually exist, but 
it is assumed to be located between the two middle 
items. Were, we to experiment with a much larger 
number of leaves, but chosen, as were the 113 in the 
illustration, purely at random, we should find that 
the length of the median leaf would, like that of the 
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modal leaf, remain practically constant. The median 

therefore proves a useful type to represent a given set 

of items. 

The median may also be defined as that item whose 

size corresponds most closely to the size of all the other 

items in the array. Stated in mathematical language 

this means that, when all deviations are considered 

positive, the sum of the deviations from the median 

is a minimum. 

Deviations from Median. 

A BCDM'tF G 

Fig. 11. 

Demonstration (see Fig. 11): 

Let ABf AC, AD, AM, AE, AF, and AG represent an array 
of seven items of which AM is evidently the median. 

Then, the sum of the deviations from M is equal to BM 
-\-CM -\- DM + ME H- MF -h MG. Now let us take any 
other item as AE. The sum of the deviations from AE is equal 
to 

{BM-\-ME)-\-{CM+ME)'\-{J)M-^ME) +ME+iMF-ME) 

+(MG-'ME) ^BM+CM+DM'^MF+MG+2ME. 

But this sum is greater by ME than the deviations from the 
median, hence the sum of the deviations (taken positively) from 
the median is a minimum. 

When the items have been arrayed as in Fig. 5 the 
median is located simply by counting until the middle 
item is reached. In practice, however, we usually have 
the items given in a frequencj'^ table with a large num- 
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ber of items in each class. A simple way of finding the^ 
median in such cases is to first plot the data as an ogiXne 
and smooth. A horizontal line is now drawn through 
the midpoint of its altitude or its projection on the ver- 
tical axis. At the intersection of this horizontal line with 
the ogive, a vertical line is dropped to the horizontal axis 
and the point of intersection indicates on the scale the 
median required. In Fig. 10, there are 217 items, there- 
fore, the horizontal line is drawn through the point 
representing 108.5 on the vertical axis and the median 
height is found to be 7.15 ft. 

It is often desirable to definitely locate the median 
within a class by using the frequency table direct. This 
may be done by interpolation, the assumption being that 
the size of items varies uniformly throughout the class. 
If this assumption is true, the following formula will 
hold good. 

Let M = the median. 

c = the class interval of the class containing the 

median. 
I = the lower limit of the class- 
/ = the number of items in the class. 
i = the number of items up from the lower lijalit 

of the class ^ which the median item 

occurs. 



Then 



M = Z + ^-il 

4 



10 
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In the frequency table given in Sec. 63, there are 217 items. 
The median item, then, is number 109 in the array. But the 
109th item is the 17th item up in the array of 85 items in the 7-8 
ft. dass .Therefore 1^7, c = 1, / = 55, and i = 17. Sub- 
stituting in the formulaj 

1(2 X 17 - 1) 



M--7 + 



2X85 



M s 7.19 ft., approximately the same result obtained by the 
graphic method. 

Sec. 72. Advantages and Disadvantages of the Median 
as a Type. 

The advantages may be enumerated as follows. 

1. It may usually be located with greater exactitude 
than the mode. This is especially true in groups in 
which the mode is ill-defined. 

2. It is but slightly affected by items having extreme 
deviations from the normal. In this respect, it re- 
sembles the mode more closely than it does the arith- 
metic average. The thousand-dollar check in the 
church collection does not affect the mode at all and 
it affects the median only as much as any other single 
item larger than the median would do, that is, the 
weight of this deviation is not increased by its extra- 
ordinary size, but the item receives the same weight 
as any other instance and no more. 

3. Its location can never depejid upon a small number 
of items, as is sometimes the case with the mode. 

4. If the ntmiber of the extreme items is known, 
their size is not required in determining the median. 
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Thus, if the number of persons possessing over $100,000, 
and the number of paupers are known, the median of 
wealth could be calculated from statistics of the pos- 
sessions of the intervening classes without considering 
the value of the property of either the extremely poor 
or the extremely rich. 

5. The median is especially useful for considering 
data, the items of which are not susceptible of measure- 
ment in definite units. It is impossible to measure 
in specific units the mental characteristics of a child 
but it is perfectly possible to array a group of children 
according to their respective mentality. An arith- 
metic average, in such cases, is meaningless and prac- 
tically useless for comparative purposes but a median 
can be legitimately determined and its characteristics 
compared with other similar medians. 

Some of the disadvantages of the median are: 

1. Like the mode, it is not so readily determined by a 
pimple mathematical process as is the arithmetic aver- 
age. 

2. As in the case of the mode, a correct total cannot 
be obtained by multiplying the median by the ntunber 
of items. 

3. Like the mode, it is not useful in those cases in 
which it is desirable to give large weight to extreme 
variations. 

4. Unlike the mode, but like 'the arithmetic average, 
it is frequently located at a point in the array at which 
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actual items are few. Thus, the median wage might 
accidentally fall on the $2,373^ per day while perhaps 
only a very few men actually received this amoimt. 

5. In a discrete series in which the items are so 
slightly dispersed that they fall largely in the modal 
class, there may be so many items of the same size 
as the median that it becomes very indefinite. In 
such a casC; the niunber of items larger than the median 
may be very different from the number of items smaller 
than the median. When this difference is too great, 
the value of the median as an average is largely de- 
stroyed. 

On the whole, the median is one of the most valuable 
types for practical use, and for studies such as wages, 
distribution of wealth, etc., is often decidedly superior 
to either the mode or the arithmetic average. 

III. THE ARITHMETIC AVERAGE OR MEAN. 
A. The Simple Arithmetic Average. 

Sec. 73. Definition of the Arithmetic Average. 

The sum of all the items in a group is known as the 
aggregate. The arithmetic average may be defined 
as the stmi or aggregate of a series of items divided 
by their ntmiber. 

In computing the arithmetic average from a fre- 
quency table, the student must, of course, remember 
to multiply each item by its frequency before sum- 
mating. When the size of items is only approximately 
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known, the midpoint of the class may usually be taken to 
represent each of the items therein without introducing 
any serious error. This is especially true when the 
class interval is small. One of the characteristics of 
the arithmetic average which is derived from the defi- 
nition given above is that the sum of the deviations 
(signs considered) of all the items therefrom equals 
zero. This may be proved as follows: 

Demonstration, 
In Fig. 12, let KMi, KM2, and KMz be part of a series 

Deviations from Arithmetic Average. 

,- Ml .Mo A M* 

K' I I ' I I * 

Fig. 12. 

of items whose arithmetic average is KA, Then, by 

definition, 

KMi + KM2 + KMz' ' ' KMn ^ ^^ 

n 

By substitution, 

{KA-AMi) + {KA-AM2) + (KA+AM,) 

'- +{KA + AMn) ^ j^^ 

n 
.'. KA-AMi^-KA-AMi-^KA+AMv + KA _ 

+ AMn = nKA 
and 

nKA-AMi'-AM2+AMz+'' -AMn^nKA. 
Therefore, 

-AMi - AMi+AM,+ • . • AMn = 0. 
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Hence, the sum of the deviations from the arithmetic 
average (signs considered) equals zero. 

Sec. 74. Determination of the Arithmetic Average by 

the Short-cut Method. 

If a group of large items are all of nearly the same 
size, it is frequently a time-saver to find the arithmetic 
average by the appUcation of the following rule: 
Asstmie any number as the average; find the sum of 
the deviations therefrom, having regard for the signs 
in each case; divide the stmi of the deviations by the 
total number of items; then, add the quotient to the 
assumed average. The result is the true average. 
To illustrate: 

TABLE X. 

Shobt-cut Method of Comptjtino the Aiuthmetic Avebaoe. 



Items. 


Assumed Ayerage. 


Deriations from As- 
sumed Arerage. 


747 
742 
735 
738 
730 
736 


740 
740 
740 
740 
740 
740 


+ 7 
+ 2 

- 5 

- 2 
-10 

- 4 






Tot^, -12 



The number of items is 6. 

- 12 ■^ 6 = - 2, 
740 + (-2) = 738 = the true average. 

In using the short-cut method in a frequency table, 
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it is, of course, necessary to multiply each deviation 
by the number of items in the group represented. 

Algebraic Proof of the Short-cut Method. 

In Fig. 13, let KMi, KM2, KMz, and KM^ be four 
instances of a series of n items and KA be the arith- 
metic average of the whole. Assume that the quantity 

Short-cut Method fob Arithmetic Average. 



J "^ 



> V ^ y 

di d^ 

Fig. 13. 

KQ is the arithmetic average. Let di, cfe, ds, • • • d« 
be the respective deviations of the items from the 
true arithmetic average KA. Let QA = x. 
To prove: 

2 deviations ivomKQ . ^^ ^ . 
rAy = AA. 

n 

Proof: 

2) deviations fromJTQ 
n 

^ -(di-a:)-(d2-a:) + (d8+g) + (d4+g)" ■+(dn+x) 

n 

— di — ^2 + ^3 + ^4 + • • • dn + ^w; nx 

= = — =x, 

n n 

since the sum of the deviations from the arithmetic 
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average = 0. But x+KQ = KA, 

. S deviations from KQ , ^^^ ^ . 
n 

Q.E.D. 

Sec. 75. Advantages of the Arithmetic Average as 
a Type. 

1. Unlike the median or mode, it may be definitely 
located by a simple process of addition and division, 
and it is mmecessary to draw diagrams or arrange the 
data in any set form or series. 

2. It gives weight to extreme deviations which is 
desirable in certain cases. 

3. Unlike the mode, it is affected by every item in 
the group, and its location can never be due to. a small 
class of items. 

4. It is familiar to everyone and hence needs no 
explanation when used. The same cannot be, said of 
the median or mode. 

5. It may be determined when the aggregate and 
the number of items are known and information con- 
cerning the various items is entirely lacking. If we 
know the amount of sugar manufactured and imported 
into the United States annually and the population 
of the United States, we may calculate the average 
consumption of sugar per capita and never know how 
much any single consumer uses. This would be im- 
possible with any other average. 
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Sec. 76. Disadvantages of the Arithmetic Average 
as a Type. 

1. It cannot be located on a frequency graph when 
such is abready at hand. 

2. It cannot be accurately determined where the 
extremes of a series are missing. In this respect, it 
is surpassed as a type by both the median and the mode. 

3. It emphasizes the extreme variations which in 
most cases is tmdesirable. 

4. It cannot like the median be used with advantage 
in the study of incommensurable quantities. 

5. It is likely to fall where no data actually exist. 
It is easy to find by computation that the average 
number of persons in a family is 5.41, although such a 
number is evidently impossible. 

B. The Weighted Akithmetic Average. 

Sec. 77. Definition of the Weighted Average. 

By a weighted average, we mean one whose constitu- 
ent items have been multiplied by certain weights before 
being added, the sum thus obtained being divided by 
the sum of the weights instead of by the number of 
items. The weights used may represent the actual 
or estimated number of items existing in a certain 
group, in which case it does not differ essentially from 
a simple average. If, for example, we know the wages 
paid to a few men in each occupation in an industry 
and we desire to ascertain the average wage for that 
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industry, we must multiply the average wage foimd for 
each occupation by the number of men engaged in 
that occupation, simmiate the results, and divide the 
sum by the total nimiber of men employed in the in- 
dustry. If we took simply an arithmetic average of 
the samples, it would evidently be inaccurate unless 
the numbers of samples in different occupations were 
in the same ratio as the total number of men engaged 
in the respective occupations. The result obtained 
by means of the weighted average is, in this case, ap- 
proximately the same as if we had the wage of each 
man in the industry recorded and then obtained a simple 
arithmetic average of the entire data. 

In other instances, the weights do not represent 
numbers but stand for estimates of relative importance. 
In making up a semester average of the grades received 
by a student, the teacher usually assigns an arbitrary 
weight to the different factors as, for example, 3 to the 
class grade, 2 to written work, and 4 to the final ex- 
amination, the sum of the products being divided by 9. 
In this case, the weighted average corresponds less 
closely to any simple average. 

Sec. 78. Effects of Weighting. 

If the number of weights used is small, the size of 
weights chosen is likely to have a marked effect on 
the average. When the weights are very numerous, 
the chances are that they will tend to offset each other, 
so that the results will be but little different from those;* 
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obtained by using a simple average. This depends 
largely, however, on whether there is any relationship 
between the size of the weight and the size of the item. 
If large weights go with large items, or vice versa, the 
average will be seriously affected whenever weights are 
neglected or erroneous ones used. One usually finds 
in the study of wages that low wage men are numerous 
and high wage men few. If, 'then, one used a simple 
average of the wages in all the occupations of an in- 
dustry, the single superintendent would be given as 
much weight as the thousand common laborers under 
his direction and his large salary would make the av- 
erage wage appear much too large. In such cases, 
therefore, weights cannot be neglected. It may, how- 
ever, be mathematically demonstrated^ that an error 
in weights tends to be much less serious in its effects 
on the final result than an error in the size of the original 
items. 

An error in the size of the original items cannot be 
remedied by adjustments in the weights used. Hence, 
the following general rule may be enunciated: 

The items should be as exact as possible and the 
weights used should be approximately accurate but 
great exactness in the size of weights causes much 
extra work and is unnecessary. 

^ For proof see Bowley's Elements of StatisticSf pp. 203-205. 
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IV. THE GEOMETRIC AVERAGE. 

Sec. 79. Definition of the Geometric Average. 

The geometric average is obtained by multiplying 
together the n items in a series and then extracting 
the nth root of the product. It is almost necessarily 
computed by the use of logarithms. It was largely 
used by Jevons in his study of prices, but has not found 
much favor among statisticians in general. 

Sec. 80. Characteristics of the Geometric Average. 

The geometric average is always slightly smaller 
than the arithmetic average. It gives comparatively 
littie weight to extreme variations. In this respect, 
it lies between the arithmetic average and the median. 
It requires more time to compute than other averages. 
It has the disadvantage of not being commonly tmder- 
stood and being somewhat difficult of comprehension 
to the non-mathematical mind. 
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CHAPTER XIII. 
DISPERSION. 

Sec. 81. Explanation of Dispersion. 

The term dispersion is used to indicate the fact 
that, within a given group, the items differ from one 
another in size, or, in other words, that there is a lack 
of uniformity in their magnitudes. When we say 
that the dispersion is sUght, we mean that this difference 
is trivial when compared with the absolute size of the 
average item while the dispersion is said to be great 
when such variation is relatively large. If, for ex- 
ample, a military company were composed entirely of 
men ranging in height from 68 to 70 inches, we would 
say that their height was very uniform or that the dis- 
persion was slight. If, however, the shortest men were 
only 62 inches and the tallest were 74 inches in stature 
we should then say that there was considerable dis- 
persion in the heighths of the men. To cite another 
instance: In the early days of the frontier, wealth was 
quite evenly distributed but, today, with our million- 
aires and paupers, we have a wide dispersionof wealth. 
In Array I of Fig. 14, we find a total dispersion in length 
of five-sixteenths of an inch, while in Array II the dis- 
persion is ten-sixteenths of an inch. 

The dispersion of a group may be measured by the 

difference in size or characteristics of the most extreme 

141 
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AmiATS or Leaf Lengths Illustbating Dispersion. 
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items, in other words, the range, or it may be measured 
by the general deviation of the items from the type. 
The range is too indefinite to be used as a practical 
measure of dispersion. If, in a community, the shortest 
adult were 5 ft. and the tallest 6 ft. 1 in. in height, the 
range would evidently be 13 inches but, if a dwarf 
whose height was but 3 ft. 6 in. should move into the 
neighborhood, the range would suddenly be increased 
to 31 inches, while the average height of the people 
would be but trivially affected. It is evident that a 
measure so radically affected by stray items at the 
extremes must be practically valueless. We must, 
therefore, measure dispersion by the deviation from 
some type or average or at least modify the range in 

■X. 

such a way as to eliminate the scattering extreme 
items. 

Dispersion may also be measured absolutely or 
relatively. In the first case, the average size of the 
items makes no difference, while, in the second case, 
this is of fundamental importance. A difference of 
an inch in the heights of a company of men would be 
very slight but a difference of an inch in the lengths of 
their noses would be decidedly noticeable. In Fig. 14, 
the absolute range of dispersion in Arrays II and III 
is the same in each case, but, relatively, it is more than 
twice as great in Array II as in Array III for the average 
size of an item in Array III is more than double that 
of one in Array II, 
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To render the relative dispersion in diif erent groups 
comparable, it is necessary to obtain a coefficient of 
dispersion for each group. This coefficient is computed 
by dividing the absolute measure of dispersion used 
by some quantity representing the tsrpical sized item. 
The coefficient, then, represents the fraction of variation 
occurring generally in the given group of data. 

Sec. 82. Moments. 

Dispersion is commonly measured by finding the 
average deviation of the items from some one of the 
types, usually the arithmetic average, the mode or the 
median. For measuring such deviations, the various 
moments are used. The first moment is simply the 
average deviation, or, in other words, the sum of the 
deviations divided by the number of items. If mi, 
Wj, ms, • • -mn are the items, n the number of items, and 
di, dj, ds, ' "dn the respective deviations of the items 
from the type, then the moments are expressed as 
follows: 

First moment: Sd/n. 

Second moment: XcP/n. 

Third moment: X<fi/n, 

I. MEASURES AND COEFFICIENTS OF DISPERSION. 

A. First Group. Based on First Moment. 

Sec. 83. The Average Deviation and the Corre- 
sponding Coefficient of Dispersion. 

In computing the average deviation, all deviations 
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are considered positive. They may be computed 
either from the mode, the median or the arithmetic 
average/ 

If mi, mzj ms, • • • mn are the ite^ms, n in number, 
having respective deviations of di, ^2, ds, • • • dn from 
the given type, the average deviation being represented 
by 6, and if a is the arithmetic average, M the median, 
and Z the mode, the average deviation may be com- 
puted by either of the following formulae according 
to the average used. 

The average deviation from the arithmetic average, 

8 = — ^^ or — . 

n n 

The average deviation from the median, 

X(m-M) XdM 

dji = — ^^ or . 

n n 

The average deviation from the mode, 

. Hjm-Z) Sd. 
Ot = or 



n n 

These measures of dispersion may be reduced to 
coefficients by dividing each by .the respective average 
employed. The coefficient of dispersion based on the 
arithmetic average 

S(m — a) Sd 5 

= or — or - . 

na na a 

11 
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That based on the median 

nM nM M * 

That based on the mode 

The following example illustrates the mode of com- 
puting from a frequency table this coefficient of dis- 
persion, the deviations from the median being used as 

a basis. 

TABLE XI. 

Computation op the Average Deviation. 



r 

Size of Item, 
m 


Frequency. 


Deviation from 

Median. 

da 


SdM 


1 

4 
5 
6 
7 
8 
9 
10 
11 


2 
3 
5 
8 
6 
4 
2 
1 


3 
2 

1 

1 
2 
3 
4 


6 
6 
5 

6 
8 
6 
4 




n=31 




Ydu =41 



The median = M = 7. 

41 



«, = 



31 



= 1.32 + 



The coefficient of dispersion 



8m_ 

M 



1.32 



= 0.19 - 
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In the above example, the median has been considered 
a whole nimiber which is correct only if the series is 
discrete. If the series is continuous, it is necessary 
to interpolate in the fourth class to locate it exactly. 
It must also be noted that all deviations are treated as 
positive. 

The characteristics of this coefficient of dispersion are: 

1. It is easy to compute and comprehend. 

2. It takes every item into consideration. 

3. It gives weight to deviations according to their 
size, extreme deviations having more weight than small 
ones, but not being disproportionately magnified. 

This coefficient is a good one to use in many economic 
studies as, for example, in calculating the personal 
distribution of wealth in a community or a nation, 
since the very rich and the very poor are both taken 
into account. The question as to which average should 
be used in the computation is not usually of great 
importance. For the distribution of wealth, it is 
probably preferable to use the deviations from the 
median. 

B. Second Group. Based on the Second 

Moment. 

Sec. 84. The Standard Deviation and Coefficient. 

The only measure of dispersion in this group in 
extensive use at present is the standard deviation. It is 
conceivable that a similar measure might be used whose 
deviations were based upon the mode or median, but 
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the standard deviation is invariably computed from 
the arithmetic average. The formula is as follows: 

<r = standard deviation. 
The other letters are used as in Sec. 83. Then 



w 



Second Moment 



-v?-v 



S(m - a)2 



n 



In calculating the standard deviation by use of an 

ordinary frequency table, the following illustrates the 

direct method. 

TABLE XII. 

Calcitlatign of Standabd Deviation from a Frequency 

Table: Direct Method. 



Size of 












Items in 
Mm8. 


Freqaencj. 


n\f 


Deyiation. 
d 


<9 


/» 


m 












8 


2 


16 


-3 


9 


18 


9 


4 


36 


-2 


4 


16 


10 


6 


60 


-1 


1 


6 


11 - 


9: 


99 











12 


6 


72 


+1 


1 


6 


13 


4 


52 


+2 


4 


16 


14 


2 


28, 


+3 


9 


18 


1 

1 

1 


n=33 


Xm= 363 






Sd»=80 



• - A^ - ^f - ^^^ - '■'« - 

Having obtained the standard deviation, all that is 
necessary to derive the corresponding coefficient of 
dispersion is to divide by the arithmetic average. 



:> 
/ 



DISPERSION. 149 

Therefore, the coefficient of dispersion 

<r 1.56 



a 11 



= 0.14+. 



Sec. 85. The Short-cut Method for Computing the 

Standard Deviation. 

The preceding method of computing the standard 
deviation is the simplest if the arithmetic average 
chances to be an even nimiber. When, however, it is 
fractional, the effort involved in squaring and multi- 
plying the decimals is considerable and it is preferable 
to use the short-cut method instead. The rule there- 
for is as follows: Select some whole number approxi- 
mating the arithmetic average; compute the devi- 
ations therefrom; square each; summate; subtract 
therefrom n times the square of the difference between 
this number and the true average ; divide by n ; extract 
the square root of the quotient. 

The algebraic formula employed in this method is: 

If a; = the assumed average 
and a = the true average. 

Then 



-4 



i(m - xY - n(a - xY 



n 



To illustrate the method by example. 
Assumed average = x = 9. 
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TABLE XIII. 

CJOMPUTATION OF THE STANDARD DEVIATION BT THE ShORT-CUT 

Method. 



Size of 

Item. 

m 


Freqaenoy. 


nkf 


m— « or 


(m— x)" or 


/d.' 


1 1 

6 
7 
8 

9/- 
10 
11 
12 


2 
4 
5 
7 
4 
3 
1 


12 
28 
40 
63 
40 
33 
12 


-3 

-2 

-1 



+1 
+2 
+3 


9 
4 
1 


1 
4 
9 


18 
16 
5 

4 
12 
9 




n»26 


2m =228 




2di=64 



a = ^=8.77-, 

a - X = 0.23 +, 

(a - x)« = 0.053, 

n(a - xY = 1.375 +. 



-J 



Sdx^ - n(a - x)« 



n 



>^ 



= J^*^^=V2l^ 



= 1.55. 



The standard coefficient of dispersion then equals ff/a. 
But 

V 1.55 



a 8.77 



= 0.177. 



The correctness of the short cut method is based upon the 
following proposition: The sum of the squares of the deviations 
from the arithmetic average is a minimum. This theorem is 
demonstrated thus: 
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Given: 

To prove: 
Proof: 



But 



But 



a — the true arithmetic average. 
X = any other assumed number 

2(w - xy > S(m - o)«. 

(m — xy = m* — 2xm + x*, 
2(m - a;)* = Sm* - 2xSm + nx*. 

2m = the aggregate = na, 
.', S(w — xy = Sw« — 2xna + nx* 
= Sw* + w(x« - 2aa;) 
= Sm« + n(x* - 2ax + a«) 
= Sw* — na* + n(x — a)*. 

2(m — ay = Sm« - 2aZm + na* 
= Sm* — 2a • na 4- na^ 
= Sm« — na*. 
.'. 2(m - a;)« > 2(m - a)«. 

From the above, it follows that 

2(w — xy — n(x — ay = 2(m — a)*. 



— na* 



Q.E.D. 



But 



. l 2(m - a;)' - n (x - a)* _ / 2(m - a)« 
••\ n . " \ n • 



^ / 2(m~o)« ^ 
> n 



.'. <^ = V 



S(m - x)* — n(x — ay 



n 



But this is the formula for the short-cut method for the standard 
deviation and it is thus proved correct. 

Sec. 86. Characteristics and Uses of the Standard 
Deviation and Coefficient. 

The standard deviation has, in the past, been used 
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more by biologists than by economists. The squaring 
of the large deviations gives more weight to extreme 
instances than to those dififering but sUghtly from the 
mean and, for some purposes, this property is valuable. 
In most economic studies, however, the reverse tends 
to hold true and the average deviation is therefore 
preferable. An important exception to this rule, 
however, is found in the use of this coefficient in the 
computation of Karl Pearson's coefficient of correlation, 
a subject which will be discussed in a later chapter. 
The squaring of the deviations eliminates the negative 
signs and hence facilitates the mathematical manipu- 
lation of the figures. This is a valuable property of 
the standard deviation when it is used for advanced 
work in the study of symmetrical frequency distri- 
butions and, largely for this reason, it has proved a 
favorite with biologists. On the other hand, it requires 
considerably more effort to compute "the standard 
deviation than the average deviation and, partially 
for this reason, the latter is commonly used by econ- 
omists unless there is some special reason for pre- 
ferring the former. 

Another measure of dispersion based, like the stand- 
ard deviation, on the second moment is the modulus, 
commonly represented by c. The formula for it is: 

\ n ^^ \ n ' 

It has little place in the field of elementary statistics 
and so will not be discussed in this book. 
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C. Third Group — Based on Quartiles. 

Sec. 87. Quartiles, Deciles, etc. 

The median was defined as the middle item of the 
array. Similarly the quartiles are those items that 
divide the nimiber of items in an array into fourths, 
the deciles those that divide it into tenths, the percen- 
tiles into hundredths, etc. The second quartile, the 
fifth decile and the median are evidently synonymous. 

The median is the — ^ — item; the first quartile is the 
— J — item; the third quartile iiS the — - item; the 
first decile is the — ttj" item; the seventh decile is the 

— - item; the twenty-fourth percentile is the 

24 (n + 1) 

— ^-yrjT — ^ item; etc. In Fig. 5 we found the median 

to be the fifty-seventh item in the group. Similarly, 
the first quartile would be the twenty-ninth item, and 
the third quartile would be the eighty-fifth item. The 
quartiles, deciles, etc., are usually located by means of an 
ogive, its altitude being divided into fourths, tenths, etc.^ 
as the case may require. They may also be located in an 
array by simple division or in a frequency table by 
division and interpolation within a group, following 
the same formula used for determining the median. 
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Sec. 88. The Quartile Measure and CoeflScient of 
Dispersion. 

All the measures of dispersion previously discussed 
have taken into account the deviation of each particular 
item. The measure which we are now about to con- 
sider gives us a general idea of the dispersion of an 
array without going into so much detail. Half of the 
items in each array are included between the first 
and third quartiles. If the dispersion of this half of 
the items is fairly representative of the whole, we have, 
here, a very simple method of measuring it. In Fig. 5 
the first quartile is about 6.3 cm. while the third ap- 
proximates 7.1 cm., giving a range of fluctuation of 0.9 
cm. The dispersion, however, if measured from the 
midpoint between the quartiles would be but half that 
amount or about 0.45 cm. In Arrays I and II in Fig. 14, 
we see the effect of a change in the amount of dispersion 
in a group on the distance between the quartiles. 
When the distance between the extreme variates is 
doubled the distance between the quartiles is approxi- 
mately doubled also. The quartile deviation which 
is probably the simplest way of approximating the dis- 
persion of an array has the following formula. 
If 

Qi = the first quartile, 



and 
Then 



Qz = the third quartile, 
the quartile deviation = ^^ ~ ^^ . 
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The quartile deviations in Arrays II and III in 
Fig. 14 £tre equal but, since the average size of item in 
Array III is more than twice as large as in Array II, 
it is necessary to divide each by a quantity representing 
its typical size. This requirement would seem to be 
fairly fulfilled by the average of the quartile lengths or 

Qz + Qi 
2 • 

The Quartile coefficient of dispersion, then, would be 

2 Qs - Qi 



Qz + Qi Qz + Qi 
2 

The quartile deviation and its coefficient have to their 
credit the merit of simplicity and ease of computation 
and are highly satisfactory if one is dealing only with 
the main body of an array and cares nothing about 
extreme variations. A glance at Fig. 5 will show that 
the length of all leaves shorter than the first quartile 
or longer than the third would have no effect whatever 
on the quartile deviation or coefficient. Yet, half 
the leaves fell outside these limits and some of the 
variations might have been very marked indeed. The 
quartile deviation is, therefore, useless when it is 
desired to give weight to the extremes in which respect 
it is exactly the opposite of the standard deviation; 
the average deviation occupying the intermediate and, 
for' general purposes, superior position. 
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Sec. 89. The Lorenz Curve. 

We have seen that graphic methods are very useful 
in illustrating frequency distribution by means of both 
absolute and percentage histograms. Another curve, 
worked out by Dr. Lorenz, illustrates very nicely the 

Lorenz Graph, Showing Distribution of Wbai/fh. 




100 90 



80 



70 60 50 40 30 

per cent of population 
Fig. 15. 



20 10 



dispersion of a group and, in the study of the distri- 
bution of wealth, has proved especially applicable. 
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It does not, like a coefficient of dispersion, furnish a 
numerical measurement of the distribution of wealth 
and, in this respect, is inferior, but its particular merit 
lies in the fact that it pictures the distribution among the 
various sections of the population and does not merely 
give an average for the whole group. Fig. 15 outlines 
the mode of constructing this graph. If wealth were 
equally divided among the people, we should evidently 
have a straight line, like MN, connecting the extremes 
of the scales. In practice, we get curves hke a or 6. 
The closer the curve approaches the line of equal 
distribution the greater is the homogeneity of wealth 
indicated; the further it bows away from that line, the 
larger the percentage of the population in poverty and 
the greater the concentration in the hands of a few 
multimillionaires. 

In the comparative study of different times or periods, 
we usually find that the curves tend to coincide near 
the extremities. When this occurs, it is well to plot 
a little of the extreme parts of the curves on separate 
sheets. In plotting the upper extremity, the horizontal 
scale may be greatly magnified, and when studying 
the right-hand extremity, the vertical scale may be 
correspondingly increased. In this way, the different 
curves are separated so that the variations at the ex- 
tremes may be successfully analyzed. 

This form of graph is also applicable to studies of 
f the distribution among the population of land, wages, 



f 
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income, etc. On the whole, it is more serviceable for 
these purposes than the percentage histogram and 
forms a valuable supplement to the coefficient of dis- 
persion. 
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CHAPTER XIV. 

SKEWNESS. 

Sec. 90. Explanation of Skewness. 

By the term skewness as applied to frequency dis- 
tributions we denote the opposite of S3rmmetry indicat- 
ing that the dispersion of the items within a given group 
Ib not symmetrical or, in other words, that, at points 
of equal deviation above and below the mode, the fre- 
quencies are imequal. Suppose, for illustration, that 
the wheat yields of all the farms of a certain coimty are 
tabulated and a frequency table constructed from the 
data thus collected. If the soil of the whole county is 
comparatively uniform and the modal yield of wheat is 
fifteen bushels per acre, we are likely to find the class- 
frequency in any pair of classes, on opposite sides of 
the mode and equidistant therefrom, to be neariy iden- 
tical for, as we depart from the mode, the numbers of 
farms in the classes in which the yield is more than 
fifteen bushels will probably fall off in approximately 
the same ratio as the numbers in the classes producing 
less than fifteen bushels, thus giving us a normally 
symmetrical frequency distribution like that already 
studied in respect to the dice throws described in Sec. 60. 
If, on the other hand, there exists within the county in 
question a limited area of extremely sterile soil which 

is, nevertheless, utilized for wheat culture we should 

169 
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find the production of this class of farms removed 
from the modal crop by an interval much greater than 
that separating the class containing the most fertile 
farms from the mode. In this case, the dispersion 
would no longer be symmetrical and the distribution 
would be said to be skewed toward the lower side. 

The meaning of skewness is most eadly made intel- 
ligible by the construction of a histogram. If skewness 
is present, the graph no longer presents the nonnal, 
symmetrical, bell-shaped form but the base is drawn 
out to a greater extent on one side than on the other 
as illustrated by graph B in Fig. 16, the lower part of 

HiBToaRAMS Illcstratino Skewnbbb. 



the histogram beii^ here skewed far to the right from 
its normal position at A. 
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In the analysis of social phenomena, a perfectly 
symmetrical histogram is the exception and a large 
degree of skewness is to be frequently expected. Prac- 
tical applications of the measurements of skewness 
have thus far, however, been confined largely to the 
biologic field but, in certain cases, such measurements 
are also useful in the study of economic statistics. 

Sec. 91. The Effect of Skewness on the Sequence of 
Averages. 

In the symmetrical histogram A, shown in Fig. 16, 
the arithmetic average, mode and median are all 
coincident at Z. In the curve, B, they have separated. 
The large items, far out to the right, while comparatively 
few in number, have had considerable effect in pulhng 
the arithmetic average over in that direction for it 
is always located at the center of gravity of the histo- 
gram and, Uke weights himg far out on the long arm 
of a lever, these extreme instances prove more powerful 
than their mere numbers would indicate. 

The median, which bisects the area of the histograms^ 
is likewise shifted to the right by the accession of new 
instances on that side, but the size of these instances 
in this case gives them no added weight and the median, 
therefore, moves a lesser distance than does the arith- 
metic average. In curves not diverging too widely 
from the symmetrical form, the median usually travels 
over two thirds of the space covered by the arithmetic 
12 



162 ELEMENTS OF STATISTICAL METHOD. 

average. Therefore, approximately, 

M =^ Z + %{a- Z). 

The mode, being in no wise affected by the addition 
of the new items, remains at its original location. We 
have, therefore, in a skew curve, a normal sequence of 
mode, median, and arithmetic average, the la^ bemg 
carried furthest in the direction in which the curve is 
skewed. 

Sec. 92. Measures and Coefficients of Skewness. 

If we desire to compare the skewness of one curve 
with that of another, it is necessary to reduce it, in 
every instance, to some numerical quantity. Measures 
of skewness must be reduced to coefficients for the same 
reason that measures of dispersion were so reduced but, 
in the case of skewness, the average size of item does not 
constitute a suitable divisor, for the question now is 
not how much the curve is skewed in proportion to the 
size of items involved, but how much more the items 
deviate on one side of the average than on the other. 
Hence, the denominator chosen must, invariably, be 
some measurement of the average deviation or dis- 
persion of the items. With these points in mind, we 
shall proceed to consider some of the most commonly 
used measures and coefficients of skewness. 

Sec. 93. First Measure and Coefficient of Skewness. 

The distance that the arithmetic average is pulled 
beyond the mode makes one of the simplest possible 
measures of skewness. 
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If a = the arithmetic average, 
M = the median, 

8m = the average deviation from the median, 
d = the average deviation from the arith. average, 
Z = the mode, 

Sz = the average deviation from the mode, 
j = the coefficient of skewness, 
then the simplest measure of skewness is represented by 
the formula a - Z and the simplest coefficient by 

a — Z , . __ a — Z 

I dz 8z 

It makes little difference whether 8z or 8 is used in the 

denominator, provided that the same one is employed 

in each case. 

While the above is the ideal measure of skewness, 

the mode is often so ill-defined as to make it necessary 

to use as the numerator, instead, the difference between 

the median and the arithmetic average. As was noted 

in Sec. 91, this quantity is usually but one third as 

large as the difference between the mode and mean and 

this fact puts it at a disadvantage when the skewness 

is slight. When the median is employed, the formula 

for the coefficient becomes 

a — M 

Sec. 94. Second Measure and Coefficient of Skewness. 

This measure is based on the fact that, in a skew 

curve, the median no longer lies half way between the 
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quartileSy for the quartile nearest to the extended base 
of the curve is pulled in that direction more than the 
quartile opposite. This occurs because the quartile 
on the skew side is moving toward the region of lesser 
density or, in other words, lower frequency, thus having 
its movement accelerated, while the quartile on the 
opposite side is approaching the zone of maximuTn 
density and, hence, has its movement retarded. The 
median, lying usually near the mode, where the fre- 
quency is high, moves but slowly. The result is a 
gradual divergence in the relative distances of the quar- 
tiles from the median, and the difference in these two 
distances is utilized as a measure of skewness. The 
formula for this second measure of skewness, then, is 
(Q, - ilf) - (ilf - Qi) or Qs + Qi - 2M. 

This is reduced to a coefficient by dividing by the 
quartile deviation. The coefficient thus obtained is: 

■ ^ Qb + Qi - 2M 

This coefficient has the same weakness conmion to the 
quartile coefficient of dispersion — ^it fails to take into 
account the size of extreme variations. In calculating 
by this method the dispersion or skewness for a curve 
showing distribution of wealth in the United States, 
the result would be in no wise affected whether the ten 
thousand richest persons in the United States owned 
$100,000 or $100,000,000 each. In either case, they 
would all be far above the upper quartile and so the 
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location of both quartiles and median would be entirely 
unaffected. The merit of this coefficient lies in the 
fact that it is simple and easy to compute and is suffi- 
ciently accurate for practical purposes in those studies 
in which the extreme instances are not considered of 
fundamental importance. 

Sec. 95. Third Measure and Coefficient of Skewness. 
This coefficient is based upon the third moment and 
depends upon the fact that cubing a quantity does not 
change its sign. It was demonstrated in Sec. 73 that 
the sum of the deviations (signs considered) from the 
arithmetic average equalled zero. In a skew curve, 
however, the siun of the cubes of the deviations from 
the arithmetic average does not equal zero for the process 
of cubing increases the relative importance of the ex- 
treme items and these are most important on the skewed 
side of the curve. The result is that the third moment 
itself furnishes a satisfactory basis for measure of skew- 
ness. It is, in one respect, the exact opposite of the 
quartile measure of skewness, since it emphasizes the 
extremes by which the other measure was not at all 
affected. The formula for this measure of skewness, 
then, is 



* p(r^ - «)' or '/M*. 

\ n \ n 



To reduce this measure to a coefficient, various de- 
nominators may be used. Since it emphasizes the 
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extremes, it is well to use a denominator of the same 
type, such as the standard deviation in which case 
the fonnula would read. 

Let j = the coeflScient of skewness. Then 



s/Sd» 



The average deviation might be substituted instead, 
making the formula 

s/2d» 



This coefficient of skewness seems to be one of the best, 
but requires considerable work for its computation. 
Various other measures and coefficients have been 

-m 

worked out but most of them are too complicated to 
be of practical value in elementary work. 
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CHAPTER XV. 

HISTORICAL STATISTICS. 

Sec. 96. General Characteristics. 

In the preceding chapters, we have been dealing 
largely with data in which time is not a factor. One 
of the most important fields of statistics, however, is 
that which compares phenomena at different dates. 
This may be done in several different ways, among 
which the following are common. 

1. Tables of absolute figures, 

2. Absolute historigrams, 

3. Logarithmic tables, 

1. Logarithmic historigrams, 

5. Index numbers, 

6. Index historigrams. • 

We shall discuss these in order, omitting the first 
one, which scarcely needs explanation. 

Sec. 97. Absolute or Ordinary Historigrams — 
Smoothing — the Moving Average — the Trend. 

The numerical record of the changes of a variable 
during a number of successive intervals of time may be 
denominated a historical series and the graphs obtained 
when this historical series is plotted, using the sizes 
of the variable as ordinates and time intervals as ab- 
scissse, is called a historigram. This must not be con- 

167 



168 ELEMENTS OF STATISTICAL METHOD, 

fused with the histogram into the construction of which 
lime does not enter. 

The accuracy of a historical series, and, therefore, of 
a historigram depends manifestly on the length of time 
intervening between the records. An hourly record 
of temperature is more accurate than a daily, a daily 
than a weekly, etc. 

In constructing a historigram, the original points 
plotted from the data may be connected by straight 
lines, but it is usually preferable to smooth the graph 
into a curve, the rules for smoothing being somewhat 
similar to those applying to histograms. In smoothing 
free hand, one should remember that the maximmn 
j)ossible radius of curvature should be constantly 
maintained, thus avoiding, as far as may be, all sharp 
breaks, unless such breaks are known to have actually 
occurred. In such variables as records of population, 
temperature, etc., changes rarely occur suddenly and 
sharp angles are therefore normally absent. 

One of the best methods of smoothing certain varieties 
of historigrams is to use a moving average to obtain 
a trend. It is only useful in those historigrams which 
manifest more or less periodicity and the obiect of 
using the moving average is to rid the historigram of 
these fluctuations. In determining on the size of 
groups to be used in calculating a moving average, 
one should use a period of time approximately equal 
to the length of the cycle which it is desired to eliminate. 
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The best method of determining upon the proper length 
of cycle to use for the moving average group is to first 
plot the data as a historigram and then observe the 
average time-distance between the consecutive crests 
and between the successive troughs of the waves, this 
giving the approximate wave-length. The data given 
in the table below are plotted in Fig. 17. From this 
historigram, we see that the wave-length runs from six 
to eight days. It is preferable to use an odd number of 
days for the moving-average group, so that the average 
may be plotted opposite the central item of the group. 
In this case, then, we shall choose seven days as the 
appropriate length of period. 

The first step in the computation is to obtain the 
average of the first seven items and place it opposite 
the fourth item. The average of the second to eighth 
items, inclusive, is next found and placed opposite 
the fifth item. This process is continued to the end 
of the series with the results shown in the table below. 
A shorter method, when the groups are large, is to add 
each time to the last total the difference between the 
nimiber added and the nimiber dropped. 

In the table below, for example, the average from 
March 1 to 7 inclusive is 24.0°, from March 2 to 8, 
inclusive, is the same, since the same nimiber 20° is 
both added and subtracted. For March 3 to 9, 28° 

is added and only 25° subtracted, hence the total of 
the group is increased by 3° and the average becomes 
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24.4®. This process is conaderably facilitated by- 
cutting a slot in a piece of cardboard just long enough 
to cover in the table the size of group desired. 

It is impossible to accurately carry out a trend to 
the extremes of the data. The curve may be carried 
out free hand to each end or one may form artificial 
final groups by duplicating the number found at the 

TABLE XIV. 
Table Illustratinq the Determination of the Trend. 



Date. 


Maftn Temperatare 


MoTlng Arentge, 7 Baj 
urottping. 


• 


Fahrenheit in Degrees. 


Mar. 1 


20 




2 


25 




3 


22 




4 


35 


24.0 


5 


26 


24.0 


6 


22 


24.4 


7 


18 


26.1 


8 


20 


26.7 


9 


28 


28.7 


10 


34 


29.9 


11 


39 


31.9 


12 


40 


32.7 


13 


30 


33.6 


14 


32 


34.9 


15 


26 


36.1 


16 


34 


37.1 


17 


43 


38.4 


18 


48 


38.9 


19 


47 


41.1 


20 


39 


43.3 


21 


35 


44.3 


22 


42 




23 


49 




24 


50 


1 
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extreme. Thus, in the above table, 50 might be added 
at the close three successive times, forming the required 
new groups in this fashion. Either of these methods 
is purely an approximation. 

It will also be noticed that, in the moving average 
line shown in Fig. 17, all irregularities have disappeared 

The Moving Average or Trend. 
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and we have obtained the general rising trend of tem- 
perature during the entire period. To study short 
time changes, therefore, the original historigram and 
not the trend must be studied. 

It is impossible to apply the moving average with 
equal success to any and all historigrams. Fig. 18 
shows a curve in which no regular periodicity is mani- 
fested. If a moving average were used in this case, 
the only possibility would be to take a long period of 
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perhaps thirty years and this would give nothing but a 
general trend for the whole time covered without regard 
to any of the large oscillations which it might be de- 
sirable to retain. 

Care should be taken in selecting an appropriate 
vertical scale for all historigrams. If the units occupy 
too much space, small changes in the size of items will 
apparently be important fluctuations while, if the units 



HienoRiGRAM Showing no Dbfinitb Periodicity. 
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occupy too little space, the graph will assimie a false 
appearance of uniformity. This is illustrated by the 
changed appearance of the wheat-price historigram 
in Fig. 21 when its vertical scale is relatively much 
increased by converting it to an index curve. 

Sec. 98. Relative or Proportional Change. 

If the population record of a given city is as follows, 

1890 100,000 

1^^ 150,000 

1^^ 200.000 
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it is evident that the absolute increase in the two periods 
involved was identical, or 50,000 in each case. The 
proportional increase, however, differed for, in the 
first case, the population had increased by 50 per cent, 
while, in the latter period, the increase was but 33 
per cent. Still, the increase of the first period occurred 
at the rate of 5,000 per year, while, in the second period, 
the rate of increase was larger, being 6,250 annually. 
The latter, however, is based on a larger population. 
If we take the base at the beginning of the period we 
find the proportional rate of increase to be, in the first 
period, 5,000/100,000 or 5 per cent, and, in the second 
period, 6,250/150,000 or 4.17 per cent. It is the last 
named quantity, the proportional rate of change, in 
which we are most commonly interested. 

To say that the population of New York City in- 
creased a million in the last decade and only a hundred 
thousand in a decade sixty years ago does not give us any 
idea of the relative change going on for the two periods. 
In order to remedy this defect. Professor Alfred Marshall 
has devised a graphic method for readily comparing, 
on a historigram, the proportional rates of change for 
different periods. This is illustrated in Fig. 19. 

X y is a historigram showing the population of a city 
at periods ranging from 1840 to the present. It is 
desired to know whether the proportional rate of 
growth was greater between 1845 and 1860 or between 
1900 and 1910. Let M'R', D'N\ MR, and DN be 
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the respective ordinates for the given years, inter- 
secting the historigram XF at C, A\ C and A re- 
spectively. Draw C'B' and CB parallel to the base. 

HlSTOBIQBAM IlLUSTRATINQ MaRSHALL's MeTHOD OF De. 

picnNQ Pboporhonal Rate of Incbeasb. 

16 




Draw A'C and AC and produce each until they cut 
the base at F' and F respectively. 

Now, A'B' and AB represent the absolute increases 
for the given periods and A ' B '/ B ' C and A B/ B C are 



i 
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the rates of increase for the same. The proportional 

A'B' AB 

rates of increase are represented by tttdT a^^d -j^ 

respectively. But AB/BC = CR/FR (corresponding 
sides of similar triangles). Likewise 

A'B' ^ CR' 
B'C F'R'. 
AB CR 
. BC ^ FR _ 1 
" CR CR FR 

= the proportional rate of increase 1900-1910. 
And 

A'5' CR' 



B'C F'R' 



CR' CR' F'R' 

= the proportional rate of increase 1845-1860. 

Therefore, the proportional rate of increase varies 
directly as 1/Ffi or inversely as FR. In the ifigure, 
FR equals approximately bF'R', hence the propor- 
tional rate of increase for the period 190Q-1910 was only 
about one fifth of that for 1845-1860. 

The merit of the above method lies in its extreme 
simplicity of application, the only work necessary being 
to draw the lines AF and CR and measure the lines FR 
for each period. 

Sec. 99. Logarithmic Historigrams. 

The logarithmic historigram has been devised for 
the purpose of showing directly on a graph the pro- 
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portional change for all parts of the period considered. 
It depends on the fact that an equal increase in the 
logarithm of a number indicates multiplication by an 
equal number and hence an equal proportional change. 
Fig. 20 shows such a historigram. The proportional 
increase in population from 1840 to 1845 is indicated 
by the line AC, that between 1865 and 1875 by the 
line EF, and that between 1900 and 1910 by the Une 
HI. Since AC approximately equals EF, the 
proportional change for the first and second periods 
is about the same, but this is not true of the proportional 
rate of change. The latter period was twice as long 
as the former, hence, the proportional rate in the latter 
case was only about half as great. In the last period 
1900-10, the proportional change and also the pro- 
portional rate of change is much less. 

The proportional rate of change is indicated directly 
by the steepness of the curve at the given point. It 
may be calculated, approximately, for any period by 
dividing the altitude of the triangle by the base as 
AC/BC, EF/DF, etc. The results thus obtained are 
not exactly correct, and the explanation of this fact 
brings out one of the weaknesses of the logarithmic 
curve. When a logarithm is doubled, it does not 
follow that the base of the logarithm is doubled. Thus, 
if AC = 2 HI, it is not true that the proportional 
increase between 1840 and 1845 is exactly twice that 
between 1900 and 1910. A quantity when tripled 
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increases its logarithm by the log. of 3 or 0.477 but, 
when a quantity is multiplied by six, it does not increase 
its logarithm by 2 X 0.477, or 0.954, but by 0.778 
instead. DoubUng the proportional change then less 
than doubles the vertical movement of the logarithmic 
historigram. Bowley^ seeks to remedy this difficulty 
by aflSxing to the diagram a series of vertical lines 
representing the logarithms of 2, 3, 4, etc. The length 
of these lines may thus be compared with the vertical 
change in the logarithmic curve for a given period and 
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one may in this manner determine whether population 
has doubled, trebled, or quadrupled. Such a scale is 
attached to Fig. 20. 

The logarithmic historigram, while valuable for 
relative comparison in point of time, is not good for 
comparison of the sizes of different variables at the 
same time. It is of so little value for this purpose that 
the use of a base line is frequently dispensed with and 
the several curves shifted vertically until they are in 
the best position for the easy comparison of the pro- 
portional changes in each. Thus, if we wish to com- 
pare the trend of prices of lumber and steel for a decade 
we care nothing concerning the relative prices per imit 
of the two articles but we do wish to know the relative 
changes in the price of each. Logarithmic curves will 
show this nicely if placed close to each other or given 
a common starting point by means of vertical shifting 
of the whole curves. One disadvantage of logarithmic 
graphs is that the ordinary reader is unfamiliar with 
them and unable to correctly interpret their meaning^ 
since it takes practice to get a ifirm grasp of the idea 
of relativity contained therein. As a result, it seems 
best to confine their use for the present primarily to 
scientific works rather than to utilize them in more 
popular literature. 

Sec. 100. Index Numbers — General Characteristics. 

Tables of historical statistics are reduced to index^ 
numbers for two reasons — ^first, to f aciUtate comparison 
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of the relative changes in two or more synchronous 
variables; second, to permit of the computation of an 
average index series. 

The sizes of the fluctuations relative to their re- 
spective norms in a number of simple historigrams 
cannot be readily compared, especially when the graphs 
have separate origins or when the quantities are of 
very different average size. The following hypothetical 
table and the historigrams shown in the first part of 
Fig. 21 illustrates this difficulty. When the absolute 
prices are plotted as historigrams, it seems that steel 
has fluctuated considerably, while wheat has remained 
almost constant in price. This is, in reality, far from 
true, the deceptive appearance being due wholly to 
the comparative units of each variable chosen. Had 
we taken thirty bushels of wheat instead of one bushel 
as the bgisis of price, we should have found the fluctu- 
ations more closely allied. 

With the given historigrams, a change of 1 mm. on 
the vertical scale means a large relative variation in 
the price of wheat, but a very small one in the price 
of steel. To overcome these difficulties and reduce the 
two sets of prices to a comparable form, it is best to 
convert each to an index series. This may be done 
either by dividing each item of the price series by the 
price for some years arbitrarily chosen as a base or by 
dividing each by an average of the whole group. The 
former is the more common method, but the use of 
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TABLE XV. 
Derivation of Index Numbers. 





Price of 


Price of 


Index 


Index 


Year. 


bt«el 


Wheat 


Price of 


Price of 




per Ton. 


per Bushel. 


Bteel. 


Wheat. 


1890 


$30 


$1.05 


120 


117 


1891 


27 


.96 


108 


107 


1892 


24 


.94 


96 


104 


1893 


22 


.83 


88 


92 


1894 


24 


.88 


96 


98 


1895 


26 


.92 


104 


102 


1896 


22 


.72 


88 


80 




Av. $25 


Av. $0.90 


Av. 100 


Av.lOO 



HlSTORIQRAMB SHOWING PrICE-CHANGES FOB WhEAT AND 

Steeij. 




DATE 



Pig. 21. 



DATE 
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the average of a considerable group as a base is more 
satisfactory, since it is representative and less affected 
by chance variations. The average of the whole group 
is the best base of all and the one most generally ap- 
plicable in statistics. This is the method used in ob- 
taining the indices in the table given above. 

When the price indices are plotted instead of the 
prices themselves, it becomes evident that the changes 
in the ifigures for wheat and steel are, in fact, very 
similar. Now, a vertical fluctuation of 1 mm. in either 
graph has been made to represent exactly the same 
proportional change in price and the two curves may 
be legitimately compared. 

We see, then, that the reduction of a group of his- 
torical data to an index series greatly facilitates the 
comparison of different synchronous variables with 
each other, but the index series is no improvement on 
the original if it is desired to compare different periods 
in the same series as to the relative changes therein. 
Index numbers, then, aid in comparisons of the fluc- 
tuations of different variables at certain specific dates, 
but the function of bringing out well the relative 
changes over periods of time is reserved for logarithmic 
historigrams. 

Sec. 101. Average Indices. 

For many purposes, it is extremely desirable to find 
the general trend of a large number of variables con- 
sidered jointly and, to accomplish this, it is necessary 



182 ELEMENTS OF STATISTICAL METHOD, 



to obtain an average index for each recorded date. 
Good examples of such average indices are those of 
prices and wages which are prepared annually by the 
United States Bureau of Labor. An average price index 
must be computed from the prices of a large number 
of articles of varying importance. The ifirst step in 
the process is to obtain an index series for each article 
for the entire period, using a common time-base for 
all the articles. The base used by the United States 
Bureau of Labor is the decade 1890-9 inclusive. The 
following table will represent, in miniature, one method 
of obtaining the average indices for any given dates 
from the indices of the separate commodities. Several 
hundred articles might be used in practice instead of 
the seven cited below. 

TABLE XVI. 

The Median of Indices 



Date. 


Price Indices. 


Median 
Index. 


Wheat. 


Cotton. 


Steel. 


Lum- 
ber. 


Com. 


Wool. 


Leather 


1880 
1881 

1882 
1883 
1884 


101 

97 

95 

102 

105 


120 
90 

82 
108 
100 


104 

102 

94 

99 

101 


108 

103 

100 

90 

99 


103 

97 

96 

100 

104 


92 

99 

110 

100 

99 


104 

102 

96 

98 

100 


104 

99 

96 

100 

100 



The question at once arises as to which kind of average 
will produce the best results. This depends on the 
specific nature of the problem. If one wishes to study 
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the effect of a changing volume of gold or of money 
on prices, a different average is desirable than the one 
used if the relative cost of living is the subject of in- 
vestigation. In the first case, a change in the price 
of one article is just as good a criterion as a change in 
price of any other. The quantity or importance of the 
conmiodity does not enter into the question at all. 
If no other factors than the quantity of money were 
affecting prices every article should fluctuate exactly 
alike; therefore, the presence of extreme variations 
from the normal, as of cotton in 1880 or the wool in 
1882, are prima facie evidence of extraneous influences 
which should be excluded in computing the general 
trend of prices. The average which best eliminates 
these extreme variations is the median, since it takes 
no account of them. If we array all the indices for 
1880 in order of size, we find the median item to be 
104, which is the index for that year. For 1881, the 
median is 99, and so on to the end of the series. Having 
obtained an average index for each date, the general 
trend of prices may be clearly shown. 

If, however, we desire to know whether the cost of 
living has changed during a given period, an entirely 
different average is necessary. The average consumer 
is not recompensed for the fact that the price of meat 
has gone up by knowing that pepper has fallen an equal 
extent. Therefore, in computing a consumers' index, 
it is necessary to use an average which ranks each article 
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according to the amount consumed, and, for this purpose 
the weighted arithmetic average is best. The weights 
would not be identical under all conditions of con- 
sumption, but would be regulated by the budgets of 
the group of consumers under consideration. The 
index of each article is multiplied by a weight corre- 
sponding to the percentage of the income spent for this 
specific commodity. The sum of the products is 
divided by the sum of the weights to obtain the average 
consumers' index. This is illustrated in the following 

table. 

TABLE XVII. 

Weighted Average of Indices. 



wu. 

Arti- 
cle. 

' Dmt«. 


40 


16 14 


6 


24 


a Q> 

3T3 

3 

106 
100 

94 

1 


Food. 


Bent I Clothing. 


Fuel and 
Light. 


MiscellaneouB. 


In- 
dex 


Prod- 
uct. 


In- ' Prod- In- 
dex net. dex. 


Prod- ' In- Prod- 
uct. 1 dex. ' net. 

1 


In- 
dex. 


Prod- 
net. 


1891 

1892 

,, 1893 


108 
99 
93 


4,320 
3,960 
3,720 


102 1,632 110 
100 1,600 101 
98 1 1,586 89 


1,540 96 576 
1,414 101 606 
1,246 103.618 


104 

101 

95 


2,496 
2,424 
2,280 



All decimals omitted in above table. 



In 1891, the sum of the products is 10,564 and the sum 
of the weights is 100, hence the weighted average is 
approximately 106, this being the consumers' index 
for that year. 

Frequently the original index itself must be an 
average of minor indices. Thus the food index in the 
table above is found by obtaining a weighted average 



HISTORICAL STATISTICS. 185 

of the indices of each specific food item. This process 
is used instead of a direct weighted average of the 
minor items because of the fact that it is seldom possible 
to get complete figures for every commodity for every 
year. By using the above method, such gaps are 
covered without disarranging the whole system. 
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PART IV. 

COMPARISON OF VARIABLES. 



CHAPTER XVI. 

VARIOUS METHODS OF CX)MPARISON. 

Sec. 102. Purpose and Value of Comparison. 

As was mentioned in the early part of the book, 
comparison is, in general, the final goal toward which 
all statistical studies tend. Comparison is necessary 
to give us clear ideas of the relationship of things in 
time and space. It is also essential in detemuning 
whether phenomena are connected or independent 
and in establishing relations of cause and effect. 

We may wish to study: 

1. Changes of a single variable. 

2. The structure of different groups. 

3. Changes in two or more variables. 

We have already discussed in the last chapter by 
means of historical tables, and historigrams, either 
simple or logarithmic, the question of changes of a 
single variable. Most of the methods of comparing the 
structure of two different groups of data have also 
been dwelt upon at some length, but perhaps a brief 
simunary of this second case may be helpful. 

186 
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Sec. 103. Comparison of Frequency Distribution of 
Two or More Groups of Data. 

1. By simple frequency tables and histograms. 

This method is desirable when the aim is to bring 
out comparisons which show the absolute as well as 
the relative size of the various classes in the groups 
compared. It would show, for example^ the relative 
number of men of each grade employed in several 
different establishents as well as the wage distribution 
in each place. 

2. By percentage frequency tables and histograms. 
These tables and graphs show nothing concerning 

the actual size of each group or the classes therein but 
are vastly superior in making clear the relative distri- 
bution between the higher and lower groups in each 
place. These would not show the actual number of 
men employed in various establishments, but would 
bring out distinctly the relative wages paid in each. 

3. Absolute cumulative tables and ogives. 

These are used primarily for the ascertainment of 
the median, quartiles, deciles, etc., but may be utilized 
as a substitute for the simple frequency tables and 
histograms. 

4. Percentage cumulative tables and ogives. 
These are far better than their absolute counterparts 

for purposes of comparison but are not so good for com- 
puting the median, etc. 
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5. Lorenz tables and curves. 

For the purpose of showing distribution of wealth, 
income, etc., at different periods or in different places, 
these are unequalled. 

6. By coeflkients of dispersion. 

These furnish a numerical measurement of deviation 
from the type, a feature lacking in all the previous 
methods. 

7. By coeflkients of skewness* 

By their aid, a numerical measurement of the lack 
of symmetry or the concentration of the items nearer 
to one than to the other extremity of the group may 
be shown. 

8. By coefficients of correlation. 

The discussion of these is reserved for a later chapter. 

Sec. 104. Methods of Comparing Changes in Two or 

More Different Variables. 

Most of the methods of comparison have been dis- 
cussed briefly and will be merely summarized here. 
The principal ones are: 

1. By absolute historical tables or historigrams. 

This is the best method of showing the actual changes 
in different variables. If the wheat crops of the leading 
nations are thus plotted, both the change in production 
for each nation during the period and the relative 
product of each nation at any given time is revealed 
at a glance. 
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2. By index numbers and historigrams. 

These are used when the only desideratum is to 
compare changes and not the absolute size of the quan- 
tities in the two series. All the curves being reduced 
to like bases, it is easy to compare the proportional 
changes relative to the base, in the different variables 
during the same period. Thus, we can see at a glance 
whether the proportional increase m the population 
of New York from 1900 to 1910 was greater or less than 
that of Wyoming for the same period. The fact that 
the absolute increase in New York was vastly larger 
than that in Wyoming in no way obscures the record 
of the comparative proportional change. It must be 
reiterated that the index curves, however, do not indi- 
cate the proportional change in either state for the last 
decade as compared with the change of some past 
decade. 

3. By logarithmic index historigrams. 

Simple logarithmic historigrams show by their vertical 
movements the comparative proportional changes, 
relative to the preceding period, in two or more vari- 
ables during the same time-interval. Their respective 
inclinations from the horizontal at the time of crossing 
any given time-ordinat^ indicate the proportional rates 
of change in the different variables at this date. As 
has been before stated, comparison of logarithmic his- 
torigrams is facilitated by vertical shifting of the curves 
imtil they are in proximity to each other, the eye being 
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thus enabled to better follow and compare their trends. 
This effect is accompUshed mathematically instead of 
mechanically if the original data, in each instance, are 
reduced to index series and the logarithms of the indices 
instead of those of the original numbers plotted. Since 
logarithmic curves are of no practical value for showing 
the absolute size of the different variables at any given 
date, nothing whatever is lost by the preliminary re- 
duction to index-series or the consequent vertical shift- 
ing of the graphs. 

4. By coefficients of correlation. 

These will be explained later. 

Sec. 105. The Plotting of Comparative Graphs. 

If two or more graphs are to be compared, it is 
desirable that they be plotted upon the same sheet 
using the same axes and scales. The use of different 
colors of ink is one of the best methods of avoiding 
confusion. When, however, the graphs are to be 
printed, it is better to adopt different devices for each 
graph, such as making one line heavier than the other 
or utilizing dots, dashes, or combinations of the t^o. 
It is unwise to place a large number of graphs on one 
sheet, if the lines lie close together, for they become 
extremely confusing to the eye. When more than five 
or six are to be compared, it is best to select one ot the 
group as a basis of comparison and place it on each sheet, 
using for this Une a heavy ruling in order to differentiate 
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it from the rest. This rule of course appUes equally 
well to either frequency graphs or historigrams. 

Sec. 106. Long- and Short-time Fluctuations. 

It must be understood that the terms "long" or 
"short " time are purely relative and that the long 
term for one variable might be an extremely short 
period for another. Yet, in most historical series, 
there appear fluctuations of two or more types occurring 
contemporaneously. If we were to study the marriage- 
rate for the past century, we should probably find a 
more or less steady decline throughout the whole period, 
but with oscillations covering five to ten year periods 
marking the epochs of prosperity or financial depression 
and still a third series of annual waves whose crests 
would be found in the month of June. Each of these 
three variations is due to a different cause but all tnree 
causes are acting simultaneously. Likewise, a study 
of the weather changes will reveal a cycle of five or 
six days duration, due to the regular procession of the 
cyclones across the United States, an annual cycle, 
due to the passage of the earth around the sun, and a 
cycle of some fifteen years, due to causes as yet unknown. 
In order to study any one of ttese cycles by itself, it 
is necessary to adopt the procedure of the physicist 
and eliminate, in so far as possible, all the other factors. 
Unfortunately, the statistician can rarely indeed, like 
the physicist, control the conditions of his experiment. 
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but he can do the next best thing by ridding, so far as 
possible, the recorded data of the apparent effects of 
the extraneous causes. If we desire to study the long 
time changes in unemployment, it will be much better 
if the seasonal fluctuations can be removed from the 
field. This is best done by applying the moving aver- 
age according to the rules given in Sec. 97. The wave- 
length, here, is evidently one year, hence the groups 
for the moving average must cover a period of that 
length. 

Sec. 107. The Elimination of Long-time Variations. 

We are frequently interested in the short-time os- 
cillations only and desire, therefore, to eliminate all 
long-time changes. If, for example, we are studying 
seasonal fluctuations in imemployment, we find the 
oscillations due to crises or disturbances of industry 
serious hindrances in our investigation. A simple 
way to study strictly seasonal changes is to obtain a 
seasonal average for a series of years. If monthly 
records only are available, the process would be as 
shown in the table on the next page. By this average, 
the typical trend of imemployment throughout the 
season is made apparent. 

Another very important method of eliminating long- 
time fluctuations in a single given historigram is as 
follows. The data are plotted as a graph, the proper 
period selected, and the moving average line computed. 
The deviations of the original data from the trend are 
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now found and tabulated. These deviations are 
finally plotted on a horizontal base line. 

TABLE XVII. 
Percentage of Unemployment. 



Month. 


Year. 


1900 


1901 


1902 


1903 


1904 


1905 


Average. 


Jan. 
Feb. 
Mar. 
Apr. 


4.1 
3.6 
3.2 
2.1 


5.0 
4.8 
4.1 
2.7 


4.7 
5.2 

5.4 
3.8 


12.6 

12.1 

9.2 

7.4 


7.4 
8.1 
5.2 
6.1 


5.1 
4.6 
4.3 
4.2 


6.5 
6.4 
5.2 
4.2 



Sec. 108. Comparison of the Long- or Short-time 
Fluctuations in Two Historigrams. 

These are two of the commonest and most important 
of all processes in the field of comparative statistics. 
The short-time deviations are compared by reducing 
each of the variables to index numbers, applying the 
rule just given, and plotting both of the final series of 
deviations on the same base line. The process is 
illustrated in the following hypothetical table and the 
graphs obtained therefrom are shown in Figs. 22 and 23. 

Fig. 22 shows the elimination of the short-time 
changes by means of the moving average. Graph d 
shows the steady increase of the supply of the given 
commodity up to 1904 when the production begins to 
fall off slightly. Graph c pictures the trend in price. 
Until 1894, this falls as the supply increases just as 
would normally be expected but, after that date, the 
14 
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TABLE XVIII. 

Comparison of Short-time Fluctuations Only in the Supply 

AND Price of a Commodity. 



•* 




Supply 


• 


Price. 


^ 


-^ • 


Si^A 


v« 




► A 


1 


O K 

80 


< 






< 


o o S 

lir 


1880 






146 






1 


82 






140 






2 


86 


84 


+2 
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price rises steadily despite the increasing supply. This 
may be due to a falling off of the supply in other 
countries or to an increase in the demand for the com- 

Index Historigbams with Moving Averages Indicating the 
Relationship of Supply and Price. 
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modity, due perhaps to an increase in population or 
prosperity or perhaps to a change in the habits of the 
people. 

In Fig. 23, the process is reversed and the long-time 
oscillations are eliminated. Just imagine that each of 
the moving average lines in Fig. 22 is stretched out 
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straight and then superimposed on the axis MN in 
Fig. 23, without detachmg from the moving average 
lines the original historigrams from which the trends 
were derived. This, in effect, is what has been done 
in Fig. 23. When placed in this position, we perceive 
a marked relationship between the two curves, now 



Short-ume Oscillations — Supply and Price. 
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totally unobscured by the long-time changes. When 

a rises, b falls and vice versa. Since the comparison 

of short-time oscillations is needed more frequently 

than that of long-time changes, this method is of great 

practical importance. 
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CHAPTER XVII. 
CORRELATION. 

Sec. 109. Definition of Correlation. 

In the last chapter we have studied various methods 
of comparing data and, when we study comparison, 
we closely approach the subject of correlation. Cor- 
relation means that between two series or groups of 
data there exists some causal connection. Investi- 
gation might prove that the cocoanut crop of the Fiji 
islands had been increasing steadily while the money 
supply of the United States had done likewise. This 
would not imply correlation unless it could be shown 
that one was the cause of the other or that both changes 
were due to some common third factor. It, therefore, 
is often approximately true that a oc 6 when there is 
no correlation but, if correlation does exist, then, neces- 
sarily, the changes in a must bear some fixed -relation- 
ship to the changes in 6. 

Sec. 110. Kinds of Correlation. 

If extraneous factors could be absolutely eliminated, 
it would then follow that a would^ in every case, bear 
an exact mathematical relationship to fc, which might 
be for example aoc 6, aoc 1/fc, aoc V5, aoc(b + x), etc. 
Since, in practice, the influences acting are so numerous 
and complex it is impossible to rid the statistics of all 
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that are undesirable, usually but one or two of the most 
important being thus removed from the field. As the 
minor inharmonious factors still remain to confuse the 
results, it is seldom, especially in the field of social 
statistics, that any absolutely fixed mathematical re- 
lationship between two variables can be established. 
We very often must be satisfied if we learn that when one 
variable increases there is a certain tendency for the 
other to increase or vice versa. If it is proven true 
that in a large number of instances two variables tend 
alwEjTS to fluctuate in the same or in opposite directions 
we consider that the fact is established that a relation- 
ship exists. This relationship is called correlation. 
If the two graphs constantly deviate in the same 
direction at the same time we say that there is direct 
correlation, while if they steadily deviate in opposite 
directions we call it inverse correlation. In Fig. 21 
we see that the price of wheat and the price of steel 
tend to oscillate regularly in similar directions, there- 
fore, the correlation is direct, but in Fig. 23, on the 
contrary, the price curve always rises when the supply 
curve falls, hence, as far as the short-time relationship 
is concerned, the correlation is inverse. Since, in both 
of these instances, the changes of the two variables are 
closely related throughout the entire period, we would 
say that there was a high degree of correlation. In 
the first instance, the price of steel and the price of 
wheat are presimiably both dependent on some other 
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factor, such as a variation in the volume of money or 
general business conditions, while, in the second case, 
the change in the supply is the direct cause of the vari- 
ation in price. 

Sec. 111. Correlation Applied. 

We may study the correlation between two historical 
variable^ or between any other two groups of related 
phenomena. Examples of the first would be the cor- 
relation of the gold movement with exports, the marriage 
rate with the price of wheat, the amount of imemploy- 
ment with the bank clearings, etc. The second might 
be illustrated by the relation between the length and 
breadth of leaves. As a leaf grows longer does it always 
grow broader as well or does the breadth tend to remain 
constant? Do tall fathers always have sons taller 
than the normal? If so, there is correlation. Cor- 
relation is often studied by means of index historigrams 
from which the imdesirable oscillations have been, as 
far as possible, eliminated. Correlation in two variables 
may be roughly illustrated in frequency graphs. 
Graphic methods, however, are all somewhat deficient 
since they cannot give a nimierical jneasurement of 
the degree of correlation existing. For this purpose, 
we must compute a correlation coefficient, in other 
words a numerical measurement of the degree to which 
correlation exists between the subject and the relative 
as the two variables to be compared are called. By 
the term " subject " we mean the variable which is 
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to be used as a standard or measure and, by the term 
" relative " we designate the variable which is to be 
compared with or measured in terms of the subject. 

A coefficient of + 1 indicates perfect direct corre- 
lation, one of indicates no correlation whatever, and 
a coefficient of — 1 means that the correlation is perfect 
but inverse. 

Sec. 112. Kari Pearson's Coefficient of Correlation. 

When two dififerent characteristics of a given series 
of items, as, for example, the length and breadth of 
leaves, or a given characteristic in related pairs of items, 
such as the stature of fathers and sons, the ages of 
husband and wife, etc., are to be compared, or when we 
wish to study the relationship between the long-time 
changes in two historical variables, the most satis- 
factory coefficient of correlation is that devised by the 
great biologist Karl Pearson. 

Let xij a:2, Xj, etc., be the deviations of the items of 
the subject from the arithmetic average, and i/i, y2, 
yz, etc., be the corresponding deviations of the items 
in the relative. Let en be the standard deviation of the 
subject and cr, be the standard deviation of the relative. 
Let n equal the total number of pairs of items. Let r 
represent Karl Pearson's coefficient^ of correlation. 

*For the derivation of the formula for this coeflScient see 
G. Udny Yule's "Introduction to the Theory of Statistics," 
pp. 168-174. 
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Then 
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TABLE XIX. 
Computation op Kahl Pearson's Coefficient of Corre- 
lation FOR Ages of Husband and Wife. 
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22/2=348 


S(a;2/)= +287 
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The mode of computing this coefficient is shown in 
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the following table in which the respective ages of 
husband and wife are placed on the same line. This 
must be considered as a representative group of sample 
data standing for a very large number of items. 

^ ^(xy) ^ 287. ^ 287 ^ . o^ 

naiffi 20X4.02X4.17 335.27 '^''^^' 

A study of this table will make plain that the numer- 
ator principally regulates the size of the coefficient. 
If, in any pair of items, both subject and relative are 
larger or both smaller than their respective averages, 
then xy will be positive but, if the subject is larger 
than the average while the relative is smaller, then xy 
is negative. But many negative values for oi^ will 
make the coefficient small while if the values are nearly 
all positive it is likely to be quite large. If xy is quite 
uniformly negative, then the coefficient will represent 
a high degree of inverse correlation. It is evident, 
then, that the comparative location of the items of 
the respective pairs with regard to the average is the 
important criterion in this coefficient. The distance 
from the average is, relatively, of less moment. 

Sec. 113. The Application of Kail Pearson's Coeffi- 
cient to Long-time Changes in Historical Variables. 

This coefficient may be applied as well to historical 
data as to the variations in items at any specific time. 
In its original form, however, this coefficient can only 
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be used, with the former, in connection with the long- 
time changes. In Fig. 22, for instance, curves a and h 
both tend to remain on opposite sides of the arithmetic 
average line, hence a negative coefficient would be 
obtained. The size of this coefficient would be affected 
but slightly by the short-time oscillations, since few 
of these cross the average line at all. The correlation 
between the short-time oscillations is seen at a glance 

to be inverse. 

Index Historigrams. 
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Fig. 24 shows us a case in which the long-time 
changes are in the same direction while the short-time 
oscillations are in opposite directions. By the use of 
Pearson's method, we would obtain a large positive 
coefficient, but this would take no account whatever 
of the inverse relationship existing between the short- 
time fluctuations. In computing the coefficient for 
the long-time changes, the method used in Sec. 112 is 
followed throughout, the items and deviations for the 
same date being paired together. 

Sec. 114. The Modification of Karl Pearson's Co- 
efficient for Use with Short-time Oscillations. 

The modification of the historigrams in order to 
bring out distinctly the short-time fluctuations suggests 
that a like process will be useful in obtaining a coeffici- 
ent of correlation and this is the method actually 
followed. In Fig. 23 the long-time changes have been 
completely eliminated. The deviations of the items 
from the trend instead of from the arithmetic average 
are now taken into consideration in computing the x 
and y columns. These deviations are also squared and 
used in computing ori and a^. Table XVIII, when com- 
pleted for this purpose, appears as shown in Table XX. 
Using the results obtained from the latter table we find 



<ri 



= \^ = >/f =^^^^=3.78. 



0r2 = ' ^ 



^ . ^'-f? . VS3T2 = 7.98, 
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2;(xy)_ -728 ^. 

n<7i<72 25 X 3.78 X 7.98 

It will be observed that, in the foregoing table, 
n = 25, since only the years from 1882 to 1906 inclusive 
can be used in computing the coefficient. The indices, 
moving averages, and deviations are carried only to 
units place. Greater precision might be obtained by 
carrying out the decimals. The coefficient of — .965 
shows a very high degree of inverse correlation between 
supply and price, indicating that an increase in supply 
means almost invariably a fall in price and vice versa. 

Sec. 115. The Coeifficient of Concurrent Deviations. 

Karl Pearson's coefficient, when modified as de- 
scribed above, is an excellent method of comparing 
short-time fluctuations, but it requires considerable 
time and effort for its computation. There is another 
method of obtaining a coefficient of correlation which 
has the merit of extreme simplicity and, in most cases, 
may be used satisfactorily in the study of short-time 
oscillations. It is wholly worthless for dealing with 
long-time changes, since it takes almost no accoimt of 
the general trend. 

If, in the comparison of two historigrams, it is 
noticed that both curves tend to move in the same 
direction at the same time — that is, if the deviations are 
concurrent — ^we say that there is evident direct corre- 
lation between the short-time oscillations. If, at each 
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given date, the curves are moving in opposite direc- 
tions — ^the deviations being divergent — ^we know that 
the short-time movements are inversely related. In 
this case, we consider not the deviations from the 
arithmetic averages or from the moving averages but 
simply from the item of the last preceding date recorded. 
The size of the deviation is not taken into account but 
only its direction. This fact sometimes constitutes a 
weakness in this coefficient, for a trivial change from 
the precedmg year is often due to some cause other 
than the principal ones which alone can be taken into 
consideration, but the slightest change has just as great 
a weight as the largest if in the same direction. When, 
however, the pairs of items are numerous, the effects of 
such chance errors are not likely to prove serious. 

The following empirical formula is used to bring the 
coefficient into a form similar to others, that is, to make 
+ 1 indicate perfect direct correlation, — 1 perfect 
inverse correlation and no correlation at all. 

If r = the coefficient of correlation, 
n = the number of pairs of items, 
c = the number of concurrent deviations, 
then 

2c — n 



/ 2c- 



The use of the signs requires a word of explanation. 
If the quantity is negative the sign (-) is 

TV 
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TABLE XXI. 

Correlations op Short-time Fluctuations op Supply and 
Price by Means of Concurrent Deviations. 
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introduced before it and also before the radical. This 

is necessary in order that the square root may be 

extracted and the result retain the same sign as that 

of the original quantity. 

The derivation of this coefficient from the same data 

used in Table XX is illustrated in Table XXI. In this 

latter table, 

n = 28, c = 0. 
Then 



-w- 


2c - n 

n 


"f 


0-28 
28 



r=- V- (- 1), 

r =- VT, 
r =- 1. 

Therefore the inverse correlation is perfect. 

Let us suppose another case in which our study covers 
48 periods, n therefore equalling 47. If 16 pairs of 
deviations were concurrent the formula would appear 
thus: 



32-47 

tki 

47 



/ 2c-n / 

= - yj-^^ = - ViSlOi = - .56. 

This would indicate then only a moderate degree of 
inverse correlation. 

In addition to its simplicity, this coefficient has the 
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merit of being well suited for use with irregular graphs 
like the one shown in Fig. 18 in which smoothing by 
means of a moving average is well-nigh impossible. 

We have used index numbers as examples in the tables 
cited but this was only done in order to compare the 
coefficients with the historigrams shown in the illustra- 
tions. In the computation of either Karl Pearson's 
coejfficient or that of concurrent deviations, nothing is 
gained by reducing the dstta to indices and the process 
tends to introduce slight mathematical errors. 

Sec. 116. The Use of the Lag. 

We oftentimes find, in the comparison of two histori- 
grams, that, while there is evident correlation, the crests 
and troughs of the waves do not quite coincide in the 
two graphs. This may be due to the necessity of an 
interval of time elapsing between cause and effect. 
Unemployment naturally gives rise to poverty and 
poverty to pauperism but it takes a considerable 
period of time for pauperism to result from the lack of 
work. 

When one feels sure that one given phenomenon is 
the cause of another and it seems probable that a 
period of time should intervene between cause and 
efifect, the graph representing the effect should be 
lagged sufficiently to make the wave crests coincide- 
that is, the dates coupled together should not be iden- 
tical. The required length of lag can be best deter- 
mined by a study of the two historigrams. Fig. 25 
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illustrates a case in which there is marked correlation 
but curve b lags behind curve a about one month. In 
computing a correlation coefficient, therefore, it would 
be necessary to pair together the figures for o in May 
and b in June, a in June with b in July, etc. 

The Lag in Historiorams. 




S I I M ^ 



i )c 



g 3 e i I M ^ 1 



DATE 



Fig. 25. 



One should never assume a certain lag merely from 
the appearance of the graphs for, in that way, one may 
easily transpose cause and effect. By lagging curve b 
we obtain a high degree of direct correlation but, if 
we lagged a, we would find just as marked inverse cor- 
relation. We must, then, determine from information 
outside the graphs what the general time relationship 
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should be and utilize the graphs merely to obtain the 
proper length of that time interval. The lag should 
be used, whenever necessary, in the construction of 
comparative historigrams and also in the computation 
of coefficients of correlation. An example of the use 
of the lag is given in Fig. 25 and in Table XXII, ac- 
companying Sec. 121. 

Sec. 117. The Probable Error. 

If we find that two variables fluctuate together in 
two or three different instances, it by no means follows 
that this is a proof of the existence of correlation any 
more than would the fact of throwing double sixes 
with a pair of dice three times in succession prove that 
there was any connection between the dice. Such co- 
incidences are likely to be entirely due to chance. 
If, however, double sixes were thrown ten times in 
succession we would suspect that some other force 
than chance was at work; so, when we find an apparent 
relationship between the fluctuations of two graphs 
at many different dates, we beUeve that the chances 
of an actual relationship existing are very large. 

Were we to find 55 pairs of deviations out of 100 
concurrent and 45 divergent we would presume that 
the inequahty was due entirely to chance but if 70 
pairs were concurrent and only 30 pairs divergent, the 
probability of this being due to chance alone would 
be extremely slight. 

Thus, we see that the probable error of a coefficient 
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of correlation varies inversely both with the niunber of 

pairs of items and with the size of the coefficient. The 

law of probable error has been carefully worked out 

by mathematicians, and the following formula evolved 

by processes too complex to be given in a text of such 

an elementary nature as this one.^ 

If r = the coefficient of correlation 

and n = the number of pairs of items. 

Then 

.67(1 - r^) 



the probable error = 



Vn 



This formula presupposes a purely chance grouping 
of the data. It deals only with the error arising from 
the small size of the coefficient or the Umited number 
of items and it is in no way affected by irregularities 
in the size of the items or the size of classes. It means 
that the coefficient of correlation should always be 
written 

.67(1 - r^) 

r d= — 

-vn 

and indicates that the chances are that r actually lies 
between 

, ^ .67(1 - r^) _ .67(1 -r*) 
^ H 7= — - and r ^^-7= — -^ 

Sn Vn 

1 For a discussion of this point see Bowley, A. L., Elements 
of Stali8tic8, Part II. 
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Sec. 118. The Interpretation of the Coefficient of 
Correlation. 

The following rules will assist in giving a general 
idea of the interpretation of r according to its relation 
to its probable error: 

1. If r is less than the probable error, there is no 
evidence whatever of correlation. 

2. If r is more than six times the size of the probable 
error, the existence of correlation is a practical certainty. 

There might be added to the above the further 
statements that, in those cases in which the probable 
error is relatively small. 

1. If r is less than .30 the correlation cannot be con- 
sidered at all marked. 

2. If r is above .50 there is decided correlation. 
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CHAPTER XVIII. 

THE RATIO OF VARIATION. 

Sec. 119. The Ratio of Variation Defined. 

It was stated in Sec. 110 that, if all extraneous in- 
fluences could be completely eliminated, a given cause 
would always bear a definite mathematical relationship 
to its effect. It was also shown that such complete 
elimination is practically impossible. Nevertheless, it 
is frequently possible to establish the mathematical 
relationship actually existing with an approximate 
degree of accuracy. The possible relationships are 
niunerous, but we shall deal with but one of the most 
common. 

In Chap. XI, the fact was noted that items of a given 
variety tend to cluster about some specific type or 
mode. Thus, we have a modal length for leaves, a 
modal height for men and a normal price for wheat. 
The mode is usually located quite close to the arith- 
metic average, the latter quantity being, as a rule, more 
definite than the former and hence, in many instances, 
better suited for use as a base, though, in cases in which 
the mode is well defined, it is probably preferable. 

Because two variables constantly fluctuate together, 

either directly or inversely, it by no means follows that 

the deviations from the type or mean will be absolutely 

or proportionally the same. In other words, the wave- 

216 
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length may be identical but the altitudes markedly 
different. This fact may be compared to the vibration 
of two pendulums both of which have the same length 
and the same period of oscillation, though one swings 
through an arc of 5° and the other through one of 20°. 
Fig. 23 gives us an illustration of this case, curve a, 
representing price, swinging approximately twice as 
far in each direction as does curve 6, which stands for 
supply. Such deviations in price would be typical 
for commodities the demand for which is inelastic. It 
is frequently desirable to know just what is the average 
ratio between the proportional or percentage deviation 
of the two curves from their respective types. The 
grain speculator is anxious to know how much the price 
of com will be raised if the normal crop is 10 per cent, 
short. The reformer may be desirous to learn in what 
proportion liquor sales diminish when the number of 
saloons is halved. The sociologist is interested in 
finding out to what an extent the birth-rate is affected 
by the increase in the age of the parents at marriage. 
This relationship may be entitled the ratio of variation. 
Since it is a corollary of the coefficient of correlation 
and so closely related thereto, it has frequently been 
confused with the same but the two are, evidently, 
radically different in their nature. 

Since the proportional fluctuations of supply, illus- 
trated in Fig. 23, are, on the average, about one half 
as large as the price oscillations, the ratio of variation, 
in that case, would, therefore, be approximately 0.50. 
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Sec. 120. Computmg the Ratio of Variation. 

As was indicated in the last section, we desire to find 
the average ratio of the proportional deviations from the 
type of the items in the relative as compared with those 
of the subject. The first step is to determine which 
of the variables shall be taken as the subject and which 
shall be considered as the relative. In the biological 
field, it usually makes little difference which series is 
chosen for each, but, in studying the social sciences, 
the series having the larger average proportional 
deviations is taken as the subject. This is done in 
order that the ratio may be expressed as less than 
unity. In Fig. 23, therefore, the price curve a would 
be chosen as the subject and the supply curve b as 
the relative. 

One way of computing the ratio of variation for histor- 
ical series would be to take the deviation of the relative 
from the mean at each given date and divide it by the 
corresponding deviation of the subject, summate all 
the quotients thus obtained, and then find the average 
ratio by dividing by the nimiber of quotients. Thus, 
if at any given ordinate, the deviation of the subject is 16 
per cent, and that of the relative 4 per cent, the quotient 
would be 0.25, but if the deviation of the subject were 
- 10 while that of the relative was + 2 the quotient 
would be - 0.20 and so would tend to reduce the sum. 
If the oscillations were all perfectly regular, this system 
of determining the ratio would be satisfactory, but in 
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practice it cannot be well used, so that other and better 
methods have been devised, the best of these being the 
Galton graph devised by Professor Francis Galton. 

Sec. 121. The Galton Graph. 

In applying this graph to two historical variables, it is 
first necessary to reduce each to an index series. When 
it is desired to study long-time changes or when no 
well-defined trend is discernible, the index numbers 
are obtained by dividing each item by the arithmetic 
average. The pairs of indices are then plotted, using 
the subject as ordinate and the relative as abscissa 
in each case. The method of procedure may best be 
illustrated by an example. 

The following table shows the actual bank clearings 
and inunigration statistics for the United States from 
1880 to 1896^ and the same reduced to index numbers 
for use in the modified form of Galton graph shown in 
Fig. 26. The bank clearings for the year are a measure 
of prosperity. Business conditions, however, affect 
principally the immigration for the succeeding year, 
hence it is necessary to introduce a lag (see Sec. 116) 
of one year and compare the bank clearings for 1880 
with the immigration for 1881. Since the fluctuations 
in the immigration index are larger than those of the 
index for bank clearings, the former will be used as the 
subject and the latter as the relative. 

*See Statistical Abstract of the United States for 1909, pp. 
711 and 753. 
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TABLE XXII. 






Data por Galton Graph to Show the Ratio of Variation | 


BETWEEN Bank Clearings and Immioration. 


r 
1 

i 


Subject. 


Belatire. 


Dftto. 


Immlcrants Index of 
in Tens of Immi- 
ThousandB. grants. 


j 
Tk.A^ Bank Clearinn 
*'•*•• in Billions. 


Index of Bank 
Clearings. 


1 
1 

1 

1 


1881 


67 


136 


1880 ; 37 


106 


t 
1 


1882 


79 


161 


1881 49 


140 ! 


1883 


60 


122 


1882 47 


134 


1 


1884 


52 


106 


1883 40 


114 


1 


1885 


40 


81 


1884 34 


97 




1886 


33 


67 


1885 25 


71 




1887 


49 100 


1886 33 


94 




1888 


55 112 


1887 35 


100 




1889 


44 


90 


1888 31 


89 


1 


1890 

1 


46 


94 


1889 


35 


100 


1 

1 1 


1891 


56 


114 


1890 


38 


109 


i 


1892 


62 126 


1891 


34 


97 




1893 


50 102 


1892 


36 


103 


1 
1 


1894 


31 63 


1893 


34 


97 


1 

1 


1895 


28 57 


1894 


24 


69 




1896 


34 69 


1895 


28 


80 






Av. 49.1 






Av. 35.0 







Sec. 122. The Ratio of Variation. 

In the Galton graph, it is best to plot the subject 
on the vertical and the relative on the horizontal scale. 
When this is done for the preceding data, we obtain 
a number of points, some rather widely scattered, but 
tending, in general, to form a band running downward 
to the left. The next step is to draw the line most 
nearly approaching the general trend of the dots. This 
is usually done by finding a line running in the correct 
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direction, as nearly as can be located by the eye, and 
having an equal number of points on each side of it. 



Modified Galton Graph ShowinO Ratio of Variation 
BETWEEN Bank Clearings and Immigration, United 

States. 1880-1896. 
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The line RCy in Fig. 26, approximates this result. If 
the correlation were perfect, every point would be located 
either on a straight line or on some well-defined curve* 

While it is probably true that the mathematical re- 
lationship existing between subjects and relatives in 
general are such that, if they could be exactly ascer- 
tained, the graphs resulting would more frequently be 
curves than straight lines, yet it is foimd, in practice, 
that the points plotted are usually so widely scattered 
that a straight Une approaches them all about as closely 
as any mathematical curve that can be found. Hence, 
in most instances, a straight Une and not a curve is 
drawn since the straight line alone can be used for com- 
putation of the ratio of variation according to the simple 
method devised by Galton. In Fig. 26, the points are 
so widely scattered that the influence of the external 
forces in hiding the correlation, as well as the correct 
ratio of variation, is clearly evident. 

The following rules apply to Galton graphs in general : 
If the line of points slopes downward to the left, the 
correlation is direct but if it slopes downward to the 
right, then the correlation is inverse. If the points 
are so badly scattered that they show no definite tend- 
ency to form a straight line or regular curve then no 
correlation is indicated, but the more closely they ap- 
proach the linear form, the larger the coefficient of 
correlation to be expected. 

If the relative increases or decreases 1 per cent, for 



i 
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a change of a like amount in the subject, then the ratio 
of variation is evidently unity. In this case, the points 
should range themselves along a line of 45° slope such 
as MN, which represents the line of equal proportional 
variation. If, however, the relative shows a tendency 
to change less, proportionally, than the subject, the 
angle with the vertical will be less than 45°. In this 
case, the Une along which the points range themselves 
is called the regression line. It receives its name from 
the fact that, in the biological field, it has been noted 
that the forces of heredity do not lead offspring to 
inherit, in full amount, the pecuUarities of the parents. 
If both parents are one inch taller than the normal the 
chances are that their children will be something less 
than one inch above the average stature, in other 
words, they regress toward the type, hence the term 
regression. In general, a very large degree of regression, 
that is, a large deviation of line 72 C from MN indicates 
a slight degree of correlation. When /2C is nearly 
vertical, it shows that the relative is but slightly affected 
by the changes in the subject. If, for example, we were 
studying the sons of tall fathers and it was found that 
their stature was but slightly above the normal, there 
would be a strong probability that the apparent vari- 
ation was due wholly to chance and that they were 
really no taller, on the average, than the sons of men of 
average height. 

It must also be remembered that a regression line 
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based on a few instances is likely to be far from accurate, 
and, when the probable error of the coefficient of cor- 
relation is so large as to vitiate that coefficient, it is 
practically useless to attempt to compute a ratio of 
variation. 

If BB is any horizontal line cutting the line of re- 
gression at A, then, manifestly, the ratio of the average 
variations of the relative to the average variation of 
the subject is represented by AB/BC or the tangent 
of angle ACB. In Fig, 26 AB = 82 and BC = 122. 
Therefore, the ratio of variation equals 82/122 = 0.67. 
This means that, on the average, for every change of 
1 per cent, in the subject, there is a tendency for the 
relative to change in a like direction 67/100 of 1 per 
cent. The complement of this fraction, or 0.33, may 
be called the ratio of regression. 

Sec. 123. Elimination of Long-time Changes. 

Just as in the case of correlation so in computing the 
ratio of variation, long-time trends must be eliminated 
before we can study the relationship between the short- 
time oscillations of two historical variables but the 
process dififers slightly. The first step is to decide 
certainly whether there is a trend of such a character 
that it may be brought out by a moving average. 
If not, the method, described in Sec. 121, of taking 
deviations from the arithmetic average only is prefer- 
able. When a definite trend exists and the wave-length 
of the short-time oscillations has been determined, the 
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next step is to compute the moving-average line for 
each series. The third operation is to divide each item 
by the moving average of that series for that year. 
This gives two series of index nimibers which may be 
plotted in pairs exactly as were the indices in Sec. 122. 
A study of the following hypothetical table may help 
to make the method plain. 

TABLE XXIII. 

Data for Galton Graph to Show the Ratio of Variation 

FOR Short-time Changes Only between Bank 

Reserves and Check Circulation. 



Date. 




Subject. 




Relative. 


B 

Bank 

Reserves 

in Dollars 

per Capita. 


h 

Moving 

Average. 


B 

b 
Index. 


C 

Check 

Circulation 

in Dollars 

per Capita. 




Moving 
Average. 


c 

c 
Index. 


1880 


8.2 






27 




1881 


8.0 






26 






1882 


7.2 


7.7 


94 


22 


26 


85 


1883 


7.4 


7.6 


97 


25 


27 


93 


1884 


7.7 


7.5 


103 


30 


28 


107 


1885 


7.7 


7.5 


103 


32 


29 


110 


1886 


7.5 


7.4 


101 


31 


30 


103 


1887 


7.2 


7.3 


99 


27 


31 


87 


1888 


6.9 


7.3 


95 


30 


32 


94 


1889 


7.2 


7.3 


99 


35 


33 


106 


1890 


7.7 


7.2 


107 


37 


35 


106 


1891 


7.5 


7.1 


106 


36 


37 


97 


1892 


6.7 


7.0 


96 


37 


38 


97 


1893 


6.4 


6.9 


93 


40 


39 


103 


1894 


6.7 


6.7 


100 


40 


40 


100 


1895 


7.2 


6.6 


109 


42 


40 


105 


1896 


6.5 


6.5 


100 


41 


39 


105 


1897 


6.2 






37 






1898 


5.9 






35 







16 
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We observe that the bank reserves per capita show a 
constant tendency to diminish while, at the same time, 
the per capita check circulation is steadily increasing. 
But a change of $1 per capita does not represent the 
same proportion when the normal check circulation 
stands at $26, that it does when the normal circulation 
has increased to $39. Hence it is necessary in obtain- 
ing the correct proportional change to first obtain a 
moving average of each series. This gives a correct 
base at each point. If the original item is now divided 
by the moving average for that date, the resulting index 
shows correctly^ the relative increase or decrease due 
to the short-time influence. The pairs of indices may 
now be plotted in exactly the same manner as were 
those in Fig. 26. 

Sec. 124. The Correlation Table. 

When the items in both subject and relative are very 
numerous, it requires a large amount of effort to reduce 
each to an index nimiber and plot it as was done in Fig. 
26. To obviate this difficulty, which is especially im- 
portant in the case of biological data, where the items 
may be extremely numerous, the correlation table has 
been invented. Its purpose is to group together the 
items of the subject in classes in a frequency table and 
then find some kind of an average of the items of the 
corresponding class of the relative and compare this 
average with the median of the class in the subject. 
It is, then, simply a method of substituting averages 
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for the individual pairs of items. The simplest form 
of the correlation table is given below, the subject being 
the lengths of a set of leaves, chosen at random, and 
the relative being the corresponding breadths of the 
same leaves. The first two colunms under the heading 
" Subject '' form a simple frequency table of the leaf 
lengths. To the right of each class of lengths imder 
the title "Relative," we find a record of the breadth 
of each leaf belonging to that class of lengths. In the 
next to the last column of the table, we find the average 
breadth of all the leaves contained in that length class 
and it is this average which is to be compared with the 
median length of that class in order to obtain the ratio 
of variation. Reduction to index numbers is essential, 
for the same reasons heretofore described, before the 
data are ready for plotting as a Galton graph. 

A few statements concerning the construction of 
the table may be helpful. 

The class-intervals in the subject must all be equal 
and the same rule applies to the class-intervals within 
the relative. Classes covering all the data between 
the extremes of each group should be entered in the 
table even if no items occur in certain of the classes. 
In drafting a preliminary correlation table, it is cus- 
tomary, as the leaves are checked off in order of their 
length, to place a dot in the proper square to represent 
the breadth of each leaf. The dots are afterwards, 
simply counted and their sum entered in the corre- 
sponding square in the permanenttable. 
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The column near the right-hand margin marked, 
"Total Number/' is merely a check column and must 
correspond with the first column of the table. 

The ideal average for use in all studies concerning 
the ratio of variation seems to be the mode, for it is 
the real center from which deviations take place. 
The fact, however, that the mode is, so commonly, ill 
defined, especially when the items are few in number, 
renders it less satisfactory as a basis than the arithmetic 
average, which has the merit of being absolutely defi- 
nite. It is the latter, then, which is, in practice, most 
frequently employed and which forms the base in the 
model correlation table shown above. 

Bowley suggests^ the use of the median as being more 
convenient and just as accurate. It is one of the best 
averages when the items are numerous but not quite 
so satisfactory if they are badly scattered. If the 
class-intervals are large, it is necessary to interpolate, 
within the class, for the median item, according to the 
formula laid down in Sec. 71. If, however, the classes 
are very narrow it may be sufficiently accurate to con- 
sider the item at the midpoint of its class though this 
usually introduces a slight error. 

When the arithmetic average is employed, as in 
Table XXIV, it is computed by assuming that the lengths 
or breadths of all the leaves within any class or sub-class 
fall at the midpoint of the same; thus, in the class 44-50, 

1 A. L. Bowley, Elements of StatisticSj p. 323. 
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the leaves are all considered as having a length of 47 
mm. each. In the third column, we have the leaf 
lengths reduced to a series of index nimibers by the 
process of dividing the midpoint of each class by the 
average length of all the leaves in the entire group. 
An index series is likewise obtained for the breadths 
by dividing the average breadth of leaves in each of the 
final classes in the relative by the average breadth of 
all the leaves. 

A Uttle study of the table will show that, in ad- 
dition to its use in determining the ratio of variation, 
it gives considerable information concerning the cor- 
relation between the subject and the relative. As 
the leaves become longer, if there is correlation, they 
should become broader also. This will cause the modal 
breadth to shift constantly toward the right. The 
fact that the modes proceed regularly across the table 
at an angle approaching 45° indicates correlation. In 
the given table the line of modes slopes downward to 
the right. This indicates direct correlation. If they 
formed a regular slope falling toward the left, the cor- 
relation might be just as marked but would be inverse. 
The more closely the modes follow a straight line across 
the table and the closer the items are packed about 
the modes the higher is the degree of correlation in- 
dicated by the table. The line of totals at the foot of 
the table should, if there is correlation, show a well-de- 
fined mode. The summation of these totals acts as 
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another check on the correctness of the numbers of 
leaves entered in the table. 

If one imagines the correlation table to be a plane 
surface and the numbers of items entered in the squares 
of the relative to represent altitudes at those respective 
points, he can picture to himself a sort of rugged hill 
with its crest near the cefnter of the table. The surface 
of this hill is known as the correlation surface. If the 
correlation were perfect, the base of the hill would be 
oval in shape, the center of the hill would be in the 
center of the table, the surface would be smooth and it 
would slope off regularly in all directions so that a 
vertical section, cut through the crest in any direction, 
would always present the bell-shaped form of the normal 
frequency curve. 

The correlation table being completed properly, the 
final step is to plot, in a Galton graph, the index of each 
class of the subject as an ordinate with the index of 
the corresponding class of the relative as the abscissa. 
The line of regression is drawn and the ratio of variation 
computed exactly as described in Sec. 122. 

Sec. 125. Conclusion. 

We have now finished, in addition to a brief re\'iew 
of the history of statistics, a discussion of the elementary 
methods most necessary in the study or manipulation 
of simple statistical data. We have covered the factors 
in the collection, analysis, and comparison of large 
numbers which are believed to be most essential to 
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the practical statistician, especially in the field of the 
social sciences. For a discussion of details, the de- 
velopment of the mathematical theory, or the practical 
applications of statistics the reader is referred to more 

advanced works. 
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APPENDIX A. 

Calculating Devices. 

The statistician will find that certain mechanical 
aids are essential if he is to do much actual statistical 
work. For operations in which it is not desired to 
attain greater accuracy than three digits, small jslide- 
rules costing from $2.75 to $21.50 are very satisfactory. 
A Fuller spiral slide rule, price $30.00, is very conven- 
ient in multiplication and division and the results may 
be read correctly to five digits. The Thacher cylindrical 
slide rule costing $35.00 is slightly less convenient but 
has the additional merit of giving squares and square 
roots directly. Its accuracy is the same as that of 
the Fuller. If it is necessary to have exact results in 
multiplication or division, an arithmometer or reckon- 
ing machine is very desirable. These cost from $193 
to $338. They may also be used for addition or sub- 
traction, but, for the former purpose, are not so rapid 
as an ordinary adding machine. All of the above may 
be obtained of Keufifel, Esser and Co., of New York and 
Chicago. An adding machine is almost essential if a 
large amount of adding is to be done. The most com- 
plete ones are sold by the Burroughs Adding Machine 
Co., of Detroit, Mich., at from $325 up.^ The Wales 
adding machine manufactured by the Adder Machine 
Co., of Wilkes-Barre, Pa., is a slightly less expensive 
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machine which is satisfactory for most purpcwses. A 
still cheaper machine which adds perfectly but does not 
print is the Comptometer sold by the Felt and Tar- 
rant Mfg. Co., of Chicago, 111. 

Where great nmnbers of items are to be tabulated 
at regular intervals, as in the case of expense accounts 
or pay rolls of municipalities or large companies, public 
utility reports, and the like, the Hollerith tabulating 
and sorting machines are great time savers. The items 
must first be pimched on printed cards and the results 
are then automatically classified in any form or the 
various items of any group smnmated, as desired. 
These machines may be rented from the Hollerith 
Tabulating Machine Co., of Washington, D. C, at 
from $25 up, per month, each. 

Many books of mathematical tables are published 
which are less expensive than machines and, for some 
purposes, are more satisfactory. The following are 
some of the best. 

Barlow's Tables of Squares, Cvbes, Square-^oots, Cvbe-^oots, 

and Reciprocals of aU IrUeger Numbers up to 10,000, E. Spon, 

N. Y. Price. $2.50. 
Bowser, E. A. Five-place Logarithmic Tables. D. C. Heath & 

Co., Chicago. Price 60c, 
Bauschinqeb and Peters. Logarithmic Tables, Asher & Co., 

London. Gives eight-figure logarithms of numbers up to 

200,000. Price 18s. 6d. 
CoTswoRTH, M. B. The Direct Calculatar. Series 0. McCorquo- 

dale and Co., London. Gives products up to 1,000 X 1,000, 

Price, with index, 258. 
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Crellb, a. L. Recfientafeln, G. Reimer, Berlin. Gives 

products up to 1,000X1,000. Price 15m. 
Jones, G. W. Logarithmic Tables. G. W. Jones, Ithaca, N. Y. 

Gives six-place logarithms of numbers up to 10,000, trigo- 
nometric fimctions, squares, cubes, roots, etc. Price $1.00. 
Ludlow, H. H. Logarithmic and Other Mathematical Tables. 

John Wiley & Sons, N. Y. Price $2.00. 
Peters, J. Neue Rechentafeln fur Mtdtiplikation and Division. 

G. Reimer, Berlin. Gives products up to 100X10,000. 

Can be obtained with English introduction. Price 15m. 
Van Velzer, C. A. Four-Place Logarithmic and TrigonomMric 

Tables. Tracy, Gibbs & Co., Madison, Wis. Price 30c. 
Wells, W. Six Place Logarithmic Tables. D. C. Heath & Co., 

Chicago. Price 60c. 
ZiMMBRMANN, H. Rechentafcl. Asher & Co., London. Gives 

products up to 100X1,000; also tables of squares, cubes, 

square roots, cube roots, etc. Price 5s. 
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used 86. 
Percentage 

histograms 62. 
ogives 64. 
Percentages 

use in tables 49, 52. 
Percentage histograms 

comparison by 103. 
Percentage ogives 

comparison by 103. 
Pictograms 56. 
Plotting 

of comparative graphs 105. 
of histo^ms 61-62. 
of histongrams 96-97, 105. 
Possible error 

of arith. av. 47. 
in mathematical opert^ 
tions 44. 
Price indices. 

consumers' 101. 
monetary 101. 
Primary investigation 28-36. 
Primary method of collecting 

statistics 38. 
Probabilities, law of 15-16. 
Probable Error 

of arithmetic average 47. 
of ceofficient of correlation 
117. 
effect 118. 
in mathematical opera- 
tions 44. 
Problem 

definition of 22. 
factors of 23. 
Progressive accuracy 18. 
Proportional change 98, 100, 
104, 119, 122-3, 



Proportional rate of change 
98-99. 

m 

Quartiles 

coefficient 88. 

definition 87. 

deviation 88. 

location 87. 

measure of dispersion 88. 
Questions 

choice of 33. 

nature of 31-33. 

rules for 33. 
Quetelet, Lambert 6. 

Range 

defined 81. 
Rate of change 98, 99. 
Ratio of regression 122. 
Ratio of variation 

computation 120-1. 

defined 119. 
Regression 

definition 122. 

line 122-4. 

ratio of 122 
Relative 

definition 111-2. 

designation of 120. 

in correlation table 124. 

in Galton graph 121-2. 

in ratio of variation 122. 
Relative change 98-100, 123. 

Samples 

representative 35. 
Samplmg 

methods of 16, 35. 
Schedules 

card 32. 

in charge of enumerators 
32, 36. 

filled by informants 31. 

form 32. 

incomplete 38. 
SchmoUer, Gustav 6. 
Seasonal fluctuations 107. 

elimination 108. 
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Secondary investigations 27. 
Secondary method of collecting 

statistics 37. 
Se<^uence of averages 91. 
Senes 

continuous 59, 61. 
discrete 59, 61. 
Short-cut method 

for arithmetic average 74. 

proof of 74. 
for standard deviation 85. 
proof of 85. 
Short-time changes 106-107. 
correlation of 114-115. 
elimination of 108. 
Skewness 

Coefficients of 92. 
first 93. 
second 94. 
third 95. 
Effect of 91. 
Explanation of 90. 
Measures of 
first 93. 
second 94. 
third, 95. 
Smoothing 

frequency graphs 61. 
Source of statistics 20. 
Squares and sc[uare root 

accuracy in 44. 
Standard Deviation 
characteristics 86. 
computation 84. 
short-cut method 85. 

proof of 85. 
uses 86, 112. 
Standard of accuracy 40. 
Statistical method 21. 

history 10. 
Statistical regularity 15. 
Statistics 

applied 10. 
collection of 20. 
definition of 12. 
descriptive 10. 
distrust of 17. 
limitations of 19. 



Statistics 

necessity of 13. 

phases of 21. 

shortcomings 17. 

sources of 20. 

units 24. 

uses of 14. 
Subject Il2. 

definition 111. 

designation of 120. 

in correlation table 124. 

in Galton graph 121-2. 
Stissmilch, Johann Peter 6. 
Symmetrical histogram 

averages coincident 91. 

Tables 

accuracy of 52. 

anal3''sis of 52. 

correlation 124. 

form of 51. 

frequency 57. 

title of 52. 
Tabulation 

percentages 52. 

rules for 49. 
Title 

of table 50. 
Types, uses 66. 
Trend 

computation 97. 

definition 97. 

uses 97, 106-8, 114, 123. 

Units 

characteristics of 24. 
definition of 24. 
examples of 24. 
selection of 24. 

Variables 

comparison of 102, 104, 
10^110, 117, 119, 120. 
definition of 57. 
Variation 

Historical 57. 
Ratio of 

computation of 120-1. 
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Variation, Ratio of 
defined 119. 
proportional 122. 
Vital Statistics 

Life insurance 6. 
Reformation period 6. 



Weighted arithmetic average 
for consumer's index 101. 
^ value of 78. 

Weights 

efifect of 78. 
rule for 78. 
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Weighted arithmetic average 
definition 77. 
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